Solved: Intel Xeon E5-2650: discontinues memory access on matrix element leads to fewer L3 data cache read BUT more L3 data cache miss

Peter_Johnson · ‎05-29-2020

I've noticed an issue when I am counting the L3 cache read and miss number using PAPI on Intel(R) Xeon(R) CPU E5-2650 v3 @2.3 GHz.

Say, this matrix is row-major.

First data access pattern: if we access (just read) a 8192 X 8192 matrix continuously, we observed ~4,500,000 L3 Data Cache Read and ~1,100,000 L3 Data Cache Miss.

The second pattern being, instead of finishing accessing all column of a row before starting another (continuous access), we access first 1024 columns of the first row and then immediately turn to load the first 1024 columns of the second row and so forth. Once we finish accessing all the first 1024 columns of all rows, we start to load the 1024-2047 columns row-by-row until the memory footprint covers all matrix elements.

Under the second access pattern, we observe ~4,000,000 L3 Data Cache Read (slightly fewer) but ~1,600,000 L3 Data Cache Miss (much more than the continuous one).

In my opinion, L3 Data Cache Miss leads to a read from DRAM, since there is no data-level re-use or any re-access on matrix elements, I won't expect a significant difference on L3 cache miss. So, I am confused on the PAPI results. This trend also holds for other block sizes and matrix sizes.

I was wondering whether I can have a semi-quantitive understanding on this phenomenon?

Thank you very much!

McCalpinJohn · ‎06-02-2020

The nomenclature is sometimes irritatingly inconsistent, but I think the right way to describe it is:

There are two HW prefetch engines in the L2 cache -- the "streamer" and the "adjacent line fetcher".
The "streaming" prefetcher is almost always the important one.
When the "L2 streamer" sees two *accesses* into a 4KiB naturally-aligned region (*accesses*, not *misses*), it computes a stride and initiates two HW prefetches along that stride (provided that the projected addresses remain inside the 4KiB region). These are (initially) "prefetch to L2" transactions. On a processor with an inclusive L3, the data must be put in the L3 as well.
When the "L2 streamer" sees more *accesses* into that 4KiB region, it considers strides in both ascending and descending directions, and issues more prefetches.
When the L2 cache gets "too busy" (perhaps determined by the number of outstanding L2 misses?), the prefetcher switches from generating "prefetch to L2" operations and instead generates "prefetch to L3" operations.
- "Prefetch to L3" is disabled by default on SKX/CLX processors because the L3 is relatively small, the L2 is relatively large, and there is no need for inclusion. We tested enabling "prefetch to L3" with a BIOS option on some of our Xeon Platinum 8160 processors, and found no significant net impact -- some things a little slower, some a little faster, not worth worrying about.
- These operations coming out of the L2 HW prefetcher are sometimes referred to as "L3 prefetches" or "LLC prefetches". That is mostly correct, but it can be confusing if one interprets it as meaning that there are additional HW prefetch units associated with the L3 cache slices.
- BTW: It makes no sense to build a prefetcher in the L3 cache because each L3 cache slice only sees references for the addresses that are hashed to it -- e.g., 1/28th of the addresses in a processor with 28 L3 slices -- so there is no hope of seeing spatial locality at the L3 or memory controller.
- Because there is (by default) no "prefetch to L3" happening, I would guess that the SKX/CLX processors maintain the "prefetch to L2" for more (or all?) of each 4KiB naturally-aligned region than earlier processors, but I have not yet tested this systematically.

View solution in original post

McCalpinJohn · ‎06-02-2020

The first thing to figure out is how many accesses and misses are expected:

For 32-bit array elements, the array is 4.096M cache lines (256 MiB).
This is just over 10x the size of the L3 cache, so one expects little reuse -- first guess is 4.1M accesses and 4.1M misses.
The count of 4.5M observed accesses is almost 10% higher than expected.
- It is very common to see ~3% extra traffic (mostly for TLB walks), but 10% may indicate that something else is going on.
The count of 1.1M L3 misses is only about 1/4 of the expected value.
- This suggests that the L3 miss event is not counting hardware prefetches of data into the L3.

The second thing to figure out is exactly which performance counter events are being used.

PAPI has a number of ways to select performance counter events, some of which depend on definitions baked into the "perf events" subsystem of the kernel.
- If I recall correctly, the "generic" L3 reference and miss events chosen by "perf events" is actually a function of the kernel version, and is incorrect for some Intel processor families in some kernels.
The most common L3 cache miss event is EventSelect 0x2E "LONGEST_LAT_CACHE", with sub-events "REFERENCES" (Umask 0x4F) and "MISS" (Umask 0x41).
- This is "architectural" event whose meaning has changed slightly over the years as Intel processors implementations have changed. On Xeon E5 v3, this event does *not* count L3 accesses by the Hardware Prefetchers.
- On a Xeon E5-2680 v3, testing with the STREAM benchmark showed that with HW prefetching enabled, the event LONGEST_LAT_CACHE.REFERENCES only recorded about 82% of the expected value and LONGEST_LAT_CACHE.MISS was about 31% of the observed references (and 25% of the expected references).
- With HW prefetching disabled, both references and misses matched expected values (plus about 1%).

So your results are mostly consistent with the LONGEST_LAT_CACHE event on Xeon E5 v3. Because of the specific behavior of that event, the count of "misses" can be difficult to interpret. For an array 10x larger than the L3 cache, effectively all of the data is coming from DRAM. What has changed in your case is how often the HW prefetchers get the data into the L3 before the corresponding load arrives. Higher "misses" sounds worse, but this may or may not correlate with execution time. Why not? There are a number of possible reasons:

The count of "misses" does not indicate how long these loads had to wait for the data.
- If the load arrived very shortly before the data from the corresponding HW prefetch arrived, the net delay could be negligible.
- This can be investigated using the CYCLE_ACTIVITY event, which can be configure to count the number of "stall cycles" (no micro-ops dispatched) during which there is a pending load that has missed the L2 cache.
The L2 HW prefetchers are also prefetching data into the L2 caches. If that part of the prefetch is more effective, then performance (execution time) could certainly improve -- independent of L3 misses.

Reduction in execution time is the goal, so that should be included in any results.

The performance counters in the memory controllers are documented in the "Xeon E5 and E7 v3 Family Uncore Performance Monitoring Reference Manual", which can be obtained from the Intel SW Developer's manual page (https://www.intel.com/content/www/us/en/develop/articles/intel-sdm.html). In my experience, these counters are accurate in every generation and family of Intel processors. The processor also contains counters in the L3 slices, but these can be much more difficult to interpret.

PAPI may be able to access these "uncore" counters if the OS revision supports them.

If "/sys/bus/event_source/devices/" has subdirectories named "uncore_imc_*", then you have basic support for the uncore memory controller counters.
If "/sys/bus/event_source/devices/uncore_imc_*/events" subdirectories include files "cas_count_read", "cas_count_write", etc., then these events are available by name, and there is a decent chance that PAPI will be able to access them.
I don't use PAPI, so I can't help with the "how" -- I write all my own performance monitoring utilities so that I can understand what they are doing....

HadiBrais · ‎08-07-2020

@McCalpinJohn Regarding the native performance events the PAPI events are mapped to, the "L3 Data Cache Read" and "L3 Data Cache Miss" PAPI events are mapped to
OFFCORE_REQUESTS.DEMAND_DATA_RD and MEM_LOAD_UOPS_RETIRED.L3_MISS, respectively. PAPI does have an event that is mapped to LONGEST_LAT_CACHE.REFERENCES, but it's a different event.

The event MEM_LOAD_UOPS_RETIRED.L3_MISS may significantly undercount according to erratum HSE114. This may explain (perhaps only partially) the difference in L3 load miss count between the two access patterns.

I don't see why would the streaming prefetcher behave differently for the two access patterns because both access full 4KB pages sequentially (as far as I can tell from the description provided by the OP). I expect the load prefetching pattern to be mostly the same in both cases, but I could be wrong.

@Peter_Johnson:

How much variation is there in the event counts across multiple runs?
What happens if all hardware prefetchers are turned off?
What is the size of an element? It'd be useful to show the assembly code of the loop.

Peter_Johnson · ‎06-02-2020

Thank you Dr. Bandwith!

I see your points. Due to the existence of HW prefetchers, the LLC cache misses could be much fewer than what is provided by PAPI.

May I ask a follow-up question? You indicated "The L2 HW prefetchers are also prefetching data into the L2 caches" while you said "hardware prefetches of data into the L3", does it mean that Intel L2 HW prefetchers prefetches data into both L2 and L3 cache?

Thanks again!

McCalpinJohn · ‎06-02-2020

The nomenclature is sometimes irritatingly inconsistent, but I think the right way to describe it is:

There are two HW prefetch engines in the L2 cache -- the "streamer" and the "adjacent line fetcher".
The "streaming" prefetcher is almost always the important one.
When the "L2 streamer" sees two *accesses* into a 4KiB naturally-aligned region (*accesses*, not *misses*), it computes a stride and initiates two HW prefetches along that stride (provided that the projected addresses remain inside the 4KiB region). These are (initially) "prefetch to L2" transactions. On a processor with an inclusive L3, the data must be put in the L3 as well.
When the "L2 streamer" sees more *accesses* into that 4KiB region, it considers strides in both ascending and descending directions, and issues more prefetches.
When the L2 cache gets "too busy" (perhaps determined by the number of outstanding L2 misses?), the prefetcher switches from generating "prefetch to L2" operations and instead generates "prefetch to L3" operations.
- "Prefetch to L3" is disabled by default on SKX/CLX processors because the L3 is relatively small, the L2 is relatively large, and there is no need for inclusion. We tested enabling "prefetch to L3" with a BIOS option on some of our Xeon Platinum 8160 processors, and found no significant net impact -- some things a little slower, some a little faster, not worth worrying about.
- These operations coming out of the L2 HW prefetcher are sometimes referred to as "L3 prefetches" or "LLC prefetches". That is mostly correct, but it can be confusing if one interprets it as meaning that there are additional HW prefetch units associated with the L3 cache slices.
- BTW: It makes no sense to build a prefetcher in the L3 cache because each L3 cache slice only sees references for the addresses that are hashed to it -- e.g., 1/28th of the addresses in a processor with 28 L3 slices -- so there is no hope of seeing spatial locality at the L3 or memory controller.
- Because there is (by default) no "prefetch to L3" happening, I would guess that the SKX/CLX processors maintain the "prefetch to L2" for more (or all?) of each 4KiB naturally-aligned region than earlier processors, but I have not yet tested this systematically.

davcole · ‎06-24-2020

Thank you, I had the same problem!

Peter_Johnson · ‎06-02-2020

Thanks so much, Dr. Bandwidth! You are sooo cool!

All clear now. Thanks again!

McCalpin, John (Blackbelt) wrote:
The nomenclature is sometimes irritatingly inconsistent, but I think the right way to describe it is:
There are two HW prefetch engines in the L2 cache -- the "streamer" and the "adjacent line fetcher".
The "streaming" prefetches is almost always the important one.
When the "L2 streamer" sees two *accesses* into a 4KiB naturally-aligned region (*accesses*, not *misses*), it computes a stride and initiates two HW prefetches along that stride (provided that the projected addresses remain inside the 4KiB region). These are (initially) "prefetch to L2" transactions. On a processor with an inclusive L3, the data must be put in the L3 as well.
When the "L2 streamer" sees more *accesses* into that 4KiB region, it considers strides in both ascending and descending directions, and issues more prefetches.
When the L2 cache gets "too busy" (perhaps determined by the number of outstanding L2 misses?), the prefetcher switches from generating "prefetch to L2" operations and instead generates "prefetch to L3" operations.
"Prefetch to L3" is disabled by default on SKX/CLX processors because the L3 is relatively small, the L2 is relatively large, and there is no need for inclusion. We tested enabling "prefetch to L3" with a BIOS option on some of our Xeon Platinum 8160 processors, and found no significant net impact -- some things a little slower, some a little faster, not worth worrying about.
These operations coming out of the L2 HW prefetcher are sometimes referred to as "L3 prefetches" or "LLC prefetches". That is mostly correct, but it can be confusing if one interprets it as meaning that there are additional HW prefetch units associated with the L3 cache slices.
BTW: It makes no sense to build a prefetcher in the L3 cache because each L3 cache slice only sees references for the addresses that are hashed to it -- e.g., 1/28th of the addresses in a processor with 28 L3 slices -- so there is no hope of seeing spatial locality at the L3 or memory controller.
Because there is (by default) no "prefetch to L3" happening, I would guess that the SKX/CLX processors maintain the "prefetch to L2" for more (or all?) of each 4KiB naturally-aligned region than earlier processors, but I have not yet tested this systematically.