What's the trigger condition of the L1-stride prefetcher

JasperMa · ‎02-15-2021

Intel hardware Prefetcher Intel website shows that there are four kinds of hardware prefeches. The prefetcher controlled by bit 3 is the L1 stride prefetcher. However I run a test code to test what's the trigger condition of the stride prefetcher.I run the code with following steps:repeat following for 10000 times:

flush
training phase: access line 0 3 6 9 for one time
sleep for near 500 cycles
measure phase: measure one line in the OS page for one time

However I can see only the line 0 3 6 9 is hit in the cache. No stride prefetching activities can be observed even after I change the stride or the length of access pattern. So I wonder if there is no stride prefetcher in the Intel processor or there is some special trigger conditions?

McCalpinJohn · ‎02-19-2021

This prefetcher is called the "IP prefetcher", which suggests that it operates based on the sequence of addresses accessed by a single load instruction in the executing code. So the first step is to verify that your implementation is accessing memory in a loop so that all accesses are associated with the same load instruction. The compiler will often unroll loops, which would spread your loads across multiple instruction pointers and most likely defeat the prefetcher for short sequences.

I have never tested this prefetcher myself, but there are some good comments on the L1 prefetchers at https://stackoverflow.com/a/53553395

JasperMa · ‎02-23-2021

Thanks for your reply. I do the memory access in a loop actually, so there is only one instruction for the memory access. After I check the post on stackoverflow, someone really introduce some new insight of the L1 prefetches. However, after checking and doing as what they said, I still can't see the L1 prefetcher activities for both L1 stride prefetcher or L1 DCU prefetcher.

JasperMa · ‎02-23-2021

Also I have organised my code and attached, maybe any one who is interested in prefetch can run it on you machine. Just run

sudo ./run.sh

is ok. The result on my machine show that access time for line 12 is bigger than 180 cycles mostly. I think there is no problem with time measurement code because if I change the measured line from cache line 12 to cache line 6(just change it at test.c, line 103), then the access time is mostly 25 cycles.