Prefetching and memory accesses

Bo_W_4 · ‎11-10-2016

Hello,

i'm trying to understand the relationship between memory accesses, LLC misses and prefetching.

I expected memory accesses = l3_misses + l2_prefetches which misses in L2 and L3.

I also tried to confirm my expection with performance counters on braodwell with the STREAM benchmark (1 thread), I got following values:

MEM_LOAD_UOPS_RETIRED_L3_MISS 3.22E+08

L2_RQSTS_L2_PF_MISS 1.36E+09

L2_RQSTS_L2_PF_HIT 3.89E+08

L2_RQSTS_ALL_PF 1.76E+09

Memory access measured using iMC events 1.51E+09

It's impossible for me near the equation displayed above with such values. 1.51E+09 < 1.36E+09 + 3.22E+08. Memory accesses are about 10% more than issued requests by the core.

What is wrong in my calculation?

Have some more questions about prefetching

1. Is there a L3 prefecher. I only read about l1D and L2 prefetcher. However, I read here

https://download.01.org/perfmon/BDW-DE/BroadwellDE_core_V5.json

if a L2 prefetch hits the L2 cache , "L3 prefetch new types." (L2_RQSTS.L2_PF_HIT ). What?

2. If L2 prefecher does not hit l2 cache, would it prefetch from l3, or directly from memory?

3. If l2 prefetch from l3, will it be counted into usual l3 acceses?

4. will each prefetch get only one cache line?

Best,

Bo

McCalpinJohn · ‎11-10-2016

(1) There is not (as far as I know) an "L3 prefetcher", but there are two different types of prefetches that can be generated by the L2 HW prefetcher -- prefetches into the L2 cache or prefetches into the L3 cache. Some documents appear to refer to the latter as "L3 prefetches". The L2 HW prefetcher will generate one kind of prefetch or the other depending on how busy it is and other factors that have not been disclosed in detail. For Xeon E5 processors, the Uncore Performance Monitoring Reference Manuals contain some relevant information, but it is mostly implicit rather than explicit.

I tried to make sense of the L2 HW prefetch behavior by running a set of experiments that loaded one line per 4KiB page, two lines per 4KiB page, three lines per 4KiB pages, etc, but the results were never clean enough to make sense.

(2) L2 prefetches are supposed to miss in the L2 (otherwise they are just overhead). When an L2 prefetch misses in the L2 it will definitely look in the L3 for the data. If the prefetch misses in the L3 cache, it will go to memory for the data. If the prefetch is a "prefetch into L3" and it hits in the L3, that is probably treated as a no-op, but there could be side effects (such as updating the prefetch stream "confidence value" that most HW prefetch implementations include).

(3) The phrase "usual L3 accesses" does not mean anything. If an L2 HW "Prefetch to L2" transaction hits in the L3 cache, it will count as an L3 access and it will count as an L2 HW Prefetch hit in the L3 cache (e.g., using the OFFCORE_RESPONSE counters or using the CBo performance counters in the uncore.

(4) Yes, each HW prefetch operation (counted by any of the various performance counters) will result in copying one cache line.

Bo_W_4 · ‎11-23-2016

Thanks for your response. It helps a lot.

Now i'm trying to analyse L2 prefetching based upon OFFCORE_RESPONSE events. However, there is not much information in the Deverloper Manual, Volume 3, Chapter 18.12, about the 5th (Broadwell) processors. Can i use the offcore events introduced for Haswell?

I did some measurements on Broadwell, again with the STREAM benchmark. All prefetching values were 0.

McCalpinJohn · ‎11-24-2016

The counter events change from generation to generation, and events that worked OK in one generation could be completely broken in the next. Lots of events that were documented on Sandy Bridge/Ivy Bridge were dropped from Haswell -- possibly because they never worked properly on Sandy Bridge/Ivy Bridge. I have not looked at any of the hardware performance counters on Broadwell yet.

Bo_W_4 · ‎11-25-2016

I have an another quesition. Which instructions can issue memory accesses, especially for reading. I can image L2 prefetching whse data cannot be found in L3 cache and L3 cache misses. Anything else?

Using OFFCORE events and L3 Events, I can get an equation for STREAM benchmark, it looks like

OFFCORE:ANY_DATA:ANY_RESPONSE:ANY_SNP = IMC_READ.

However, this equation is not valid for the SPEC OMP benchmark (363.SWIM).

I understand SWIM do some more complicated calcualtion than STREAM und tried a lots of OFFCORE combinations. I also tried lots of other OFFCORE combinations, Still, cannot get an equation. Do you have any ideas?

Best,

Bo