Intel hardware Prefetcher Intel website shows that there are four kinds of hardware prefeches. The prefetcher controlled by bit 3 is the L1 stride prefetcher. However I run a test code to test what's the trigger condition of the stride prefetcher.I run the code with following steps:repeat following for 10000 times:
- training phase: access line 0 3 6 9 for one time
- sleep for near 500 cycles
- measure phase: measure one line in the OS page for one time
However I can see only the line 0 3 6 9 is hit in the cache. No stride prefetching activities can be observed even after I change the stride or the length of access pattern. So I wonder if there is no stride prefetcher in the Intel processor or there is some special trigger conditions?
This prefetcher is called the "IP prefetcher", which suggests that it operates based on the sequence of addresses accessed by a single load instruction in the executing code. So the first step is to verify that your implementation is accessing memory in a loop so that all accesses are associated with the same load instruction. The compiler will often unroll loops, which would spread your loads across multiple instruction pointers and most likely defeat the prefetcher for short sequences.
I have never tested this prefetcher myself, but there are some good comments on the L1 prefetchers at https://stackoverflow.com/a/53553395
Thanks for your reply. I do the memory access in a loop actually, so there is only one instruction for the memory access. After I check the post on stackoverflow, someone really introduce some new insight of the L1 prefetches. However, after checking and doing as what they said, I still can't see the L1 prefetcher activities for both L1 stride prefetcher or L1 DCU prefetcher.
Also I have organised my code and attached, maybe any one who is interested in prefetch can run it on you machine. Just run
is ok. The result on my machine show that access time for line 12 is bigger than 180 cycles mostly. I think there is no problem with time measurement code because if I change the measured line from cache line 12 to cache line 6(just change it at test.c, line 103), then the access time is mostly 25 cycles.