The "next page prefetcher"

Smith__Joe · ‎04-10-2020

Hi,
I have implemented a simple matrix-vector multiplication algorithm where the inner loop looks like this:

for(size_t i = 0; i < vecsize; i++){
    __m128 x = _mm_load_ps1(source + 4*i + 0);
    __m128 y = _mm_load_ps1(source + 4*i + 1);
    __m128 z = _mm_load_ps1(source + 4*i + 2);
    __m128 w = _mm_load_ps1(source + 4*i + 3);
    
    __m128 res = _mm_mul_ps(x, c0);
    
    res = _mm_fmadd_ps(y, c1, res);
    res = _mm_fmadd_ps(z, c2, res);
    res = _mm_fmadd_ps(w, c3, res);

    _mm_stream_ps(target+i*4, res);
}

To me this seems like a perfectly sequential access pattern so I would assume that the hardware prefetcher(s) have an easy time prefetching whichever cachlines are needed next.
However running a test with 51,200,000 elements "perf record -g -e LLC-load-misses:u" shows me 1705033 LLC misses for the whole program, about 30% of which happen inside this loop.
Is there any good explanation for why the number of LLC load misses is non zero for a loop like this?

McCalpinJohn · ‎04-13-2020

The L2 hardware prefetchers operate on physical addresses with no knowledge of the virtual memory page size in use. Therefore they only operate within 4KiB regions (the default and smallest available page size) -- since they don't know what 4KiB physical address will be accessed next.....

On a Cascade Lake Xeon, the memory bandwidth is up to 128 GB/s per socket, so it takes 32 ns to transfer 4KiB from memory. Memory latency is about 78 ns. If the prefetcher requested the full 4KiB when it saw the first reference to the page, it would still take about (78ns+32ns=) 110ns to load the page. In practice, the prefetcher is not so aggressive -- it waits until it sees two references to the same 4KiB page, then issues prefetches for two more cache lines following the same stride. If it sees a third request matching the sequence, it becomes more confident that the "stream" is real, and fetches two more lines, etc.

The details are not specifically documented, though there are some notes in the Intel Optimization Reference Manual (document 248966, especially section 2.5.5) that provide examples.

If I recall correctly, in the IBM POWER architecture a memory request can include information about the virtual memory page size in use, allowing the hardware prefetchers to operate over larger regions, with the 64KiB page size added specifically to support more effective hardware prefetching.

Smith__Joe · ‎04-14-2020

Thanks a lot for the answer, if the cache misses only occur on page boundaries, using huge tables should somewhat remedy this problem?
Also as a follow-up:
According to llcm-mca the assembly generated for this loop:

.L3:
        vbroadcastss    xmm0, DWORD PTR [rax]
        vbroadcastss    xmm1, DWORD PTR [rax+4]
        vmulps  xmm0, xmm0, xmm4
        add     rax, 16
        add     rdx, 16
        vfmadd132ps     xmm1, xmm0, xmm3
        vbroadcastss    xmm0, DWORD PTR [rax-8]
        vfmadd231ps     xmm1, xmm0, xmm5
        vbroadcastss    xmm0, DWORD PTR [rax-4]
        vfmadd132ps     xmm0, xmm1, xmm2
        vmovntps        XMMWORD PTR [rdx-16], xmm0
        cmp     rcx, rax
        jne     .L3

~~has a RThroughput of 2.3, using __rdtsc() I measure 4.1 cycles however. Is there something else I'm missing?~~
Please disregard, this was just due to the cache misses / memory bandwidth. Thanks again for your help!

McCalpinJohn · ‎04-14-2020

The L2 prefetchers in recent Intel processors stop at 4KiB boundaries no matter what page size is in use (see the bottom of page 2-46 in the Intel Optimization Reference Manual, document 248966-042b, September 2019). This could change in future processors, of course, so it is a good idea to keep in mind....

Starting in Ivy Bridge, Intel processors have a "next-page prefetcher" in the core. Because this is in the core and not in the L2 cache, it is able to operate on virtual addresses. There is no real documentation on exactly what it does, but from my testing it is clear that it causes the page table entry for the next 4KiB page to be fetched early. This prevents an additional stall in the memory access sequence for the initial access to a 4KiB-page (which can't start until the page's virtual-to-physical translation is available). It might load a cache line from the page into the L1D cache -- I have not tried to test for this explicitly. As you might expect, the next-page prefetcher does not seem to do much for contiguous accesses on 2MiB pages, but it is inconvenient to test because there is no documented way to disable this new prefetcher. (It does not appear to be difficult to "outsmart", but the impact on bandwidth/latency/cache-misses is very small, so I only studied it to understand the effect on page table walks....)

In (almost) all recent Intel processors, software prefetches will generate page table walks to enable earlier access to the next 4KiB region. This should enable the L2 HW prefetchers to ramp up faster, but it would require careful experimentation to figure out how these interact....

Important Caveat!

When using one core, bandwidth is limited by the available concurrency, so code modifications to increase concurrency are beneficial.
When using most or all cores, bandwidth is limited by the DRAM subsystem and code modifications to increase concurrency actually *decrease* performance (by making it harder for the memory controller to efficiently schedule the accesses to the available DRAM banks).

Smith__Joe · ‎04-14-2020

Thank you for this additional information although this makes me seriously question where those cache misses come from. After all this is a Skylake CPU so the "next page prefetcher" should exist and have no problem at all with recognizing my perfectly sequential access pattern.
If I change the assembly to load the same data again (replacing "add rax, 16" with "add rax, 0") I'm getting about 1.9 cycles per iteration whereas the original version takes about 4.0 cycles per iteration, so there seems to be some performance problem related to loading the new data from memory...

McCalpinJohn · ‎04-15-2020

The "next page prefetcher" appears to be primarily intended to enable the *page table entry* to be pre-loaded. It might also request a single cache line to be brought into the L1 Data Cache, but this is less certain. (It is also unclear when the prefetch is issued, but it may not be early enough to prevent the demand load miss.)

When (re-)starting the L2 HW prefetcher on a new 4KiB region, it is impossible to prefetch all 64 of the cache lines early enough to get the data into the L3 cache before the demand load misses the L2 cache. The core generates the demand loads and throws them at the cache as fast as it can. When moving into a new 4KiB region, the first of these loads will immediately miss in the L1 Data Cache, the L2 cache, and the L3 cache. The number of L3 misses depends on how long it takes to service the initial L2 HW prefetches compared to the rate at which the core can push L1 Data Cache misses to the L2.

When working with data set sizes that are larger than the caches, it is always a good idea to look at the cache miss counts scaled by the expected number of cache lines transferred. With tens of millions of elements in the array(s) and 1 or 2 LLC misses per 4KiB, the raw LLC miss count can look large even though it is close to the theoretical minimum.....

LLC misses in sequential memory accesses on skylake