Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

prefetching in L1D cache

Dear All,

The Intel Manual (Intel 64 and IA-32 Architectures Optimization Reference Manual)
says in Section that the L2 cache can only track 1 memory stream per 4K memory page.

Is this also true for the L1D cache, or can the L1D cache prefetch more than one
memory stream from the same page?


0 Kudos
3 Replies
Black Belt
I don't see a section number like that in a recent version. You appear to have nearly the same information which I see in a current one, which says the latest model supports 1 forward and 1 backward L2 prefetch stream per 4K page, while L1D supports only 1 forward stream per page.
I guess the one forum where the question would be more topical would be the AVX/instruction set forum, depending on which model you mean.
Black Belt
From my little working experience, it would appear that prefetch to L1D cache is only useful when the TLB cach for the target address is not in cache.

In order for a memory location to be read (or written) the two levels for the Virtual Memory page tables must reside in the TLB caching system (separate from LnD and LnI caches). I've experienced performance degradation when the TLB for that address is already loaded. The TLB cache, caches a small portion of the page tables.

IOW, the prefetch seems to only be effective to loading the TLB cache.

On different archectetures this may not hold.

Jim Dempsey
Black Belt
The reason for terminating these hardware prefetchers at the end of the page is to avoid page miss side effects. If you wish to initiate a prefetch to Xeon TLB cache, so as to get started early on resolution of a possible DTLB miss, you would require a software prefetch, either explicit in source code, or generated by a compiler option. I'm not reading this as part of the original question.
The question does raise interesting consequences. If an array has been modified by a different core, the prefetcher would accelerate getting it back to a core which needs the update. Only 1 forward going array in the page could be so accelerated by hardware prefetch, and an additional backward going array could be accelerated only into L2.