I don't see a section number like that in a recent version. You appear to have nearly the same information which I see in a current one, which says the latest model supports 1 forward and 1 backward L2 prefetch stream per 4K page, while L1D supports only 1 forward stream per page. I guess the one forum where the question would be more topical would be the AVX/instruction set forum, depending on which model you mean.
From my little working experience, it would appear that prefetch to L1D cache is only useful when the TLB cach for the target address is not in cache.
In order for a memory location to be read (or written) the two levels for the Virtual Memory page tables must reside in the TLB caching system (separate from LnD and LnI caches). I've experienced performance degradation when the TLB for that address is already loaded. The TLB cache, caches a small portion of the page tables.
IOW, the prefetch seems to only be effective to loading the TLB cache.
The reason for terminating these hardware prefetchers at the end of the page is to avoid page miss side effects. If you wish to initiate a prefetch to Xeon TLB cache, so as to get started early on resolution of a possible DTLB miss, you would require a software prefetch, either explicit in source code, or generated by a compiler option. I'm not reading this as part of the original question. The question does raise interesting consequences. If an array has been modified by a different core, the prefetcher would accelerate getting it back to a core which needs the update. Only 1 forward going array in the page could be so accelerated by hardware prefetch, and an additional backward going array could be accelerated only into L2.