hardware prefetching programmatically (posted again)

roberto_g_2 · ‎06-29-2020

We know how to perform software prefetching using C++ intrinsics. (Sorry for this duplicate post but it was erroneously tagged as spam.)

As for hardware prefetching to improve cache performance, Chapter 8 in the document Intel 64 and IA-32 Architectures Optimization Reference Manual shows how to do that using inline asm.

Following that suggestion, we could use the following inline assembly to hw prefect 512 bytes with a 64-byte cacheline (where, alternatively, mov r9 can be replaced by vmovaps ymm0):

inline void Prefetch512_HW(void* address)
{
 _asm {
       mov rsi, address
       mov r9, [rsi]
       mov r9, [rsi+64]
       mov r9, [rsi+128]
       mov r9, [rsi+192]
       mov r9, [rsi+256]
       mov r9, [rsi+320]
       mov r9, [rsi+384]
       mov r9, [rsi+448]
      }
}

Is there a lighter (faster) way to attain the same goal? Loading a byte in each mov instruction? Using some alternative instruction to mov for those addresses?
Suppose that we have a large number of 4KiB virtual pages and want to improve TLB caching (page walk) of those addresses: how can we make it easy programmatically?

Thank you

-Roberto

jimdempseyatthecove · ‎06-29-2020

The intent of the suggestion is to show but one example of the CPU's ability to hardware pre-fetch. In that example there are two concepts to learn from (I did not read the manual so I cannot say if it addressed this, and you overlooked this):

1) Note that the value(s) of r9, being fetched from RAM, LLC, L2, or L1 are not being immediately used as input. This permits the instruction pipeline to continue execution without a stall for the results from the source. And then this (not stalling) permits the next instruction to queue up an additional read, and the next instruction to queue up an additional read, ... The data reads, while not immediately used, will migrate from wherever they are at, and placed into L1 cache.

2) In this specific example, note that the sequence of addresses being referenced are cache line sequential locations. In CPU-speak, this is a data stream. What may or may not have been addressed in the paragraph you referenced, is most likely stated nearby in the Architectures Optimization Reference Manual, and that is the CPU is capable of detecting the code referencing multiple (1 to 8 depending on CPU) such data stream access patterns with or without a stride, and/or with or without your code immediately using the fetched data. And in which case, will continue to pre-fetch in advance of your code.

A difference in the CPU implicitly detecting and executing the pre-fetch stream, and your code explicitly executing something like the Prefetch512_HW is that your software code may generate a page fault for an address not mapped (or with access restrictions) while CPU implicit fetch will ignore the attempted access fault. It is your programming responsibility to assure (in this example) that the [rsi+nnn] was a valid address.

Jim Dempsey

roberto_g_2 · ‎06-29-2020

Thanks, the suggestion goes into that direction, where a block of these reads are performed before an actual memcpy using streaming stores.

The sequence of addresses is a data stream for a tuned memcpy indeed, where the [rsi+nnn] is a valid address.

My points are the two questions:

Can I make it lighter in terms of computation overhead? For example, since I am using also suitable sw prefetching, I discovered through vtune that I can improve performance by reduce the mov instructions, and using BYTE PTR [rsi+nnn].
I would like also to get a benefit for the TLB, as otherwise I could use only sw prefetching. I hope that hw prefetching can help with page walks when suitably called. But I cannot find any documentation on how to do this programmatically. I am just guessing so.

jimdempseyatthecove · ‎06-30-2020

>>Can I make it lighter in terms of computation overhead?

The instruction sequence (series of moves to register not used) "overhead" will be on the order of register-to-register moves. The CPU will not stall for the requested data until the receiving register (r9 in this example) is used as a source register, and in this case it is not used. There are a limited number (CPU dependent) of in-flight reads permitted. Immediately following this sequence, it is presumed that the code will re-read the same locations in the same sequence (and continue reading the stream) while using (referencing) the data read.

What the code sequence is designed to do is to trade-off 8 or so clock cycles against the reduction of fetch latency during the initial time it takes for the HW pre-fetcher to detect that you are accessing memory in a stream.

Without this "hack", say for a memmove, you would (possibly) have a move from memory to register, then move from register to memory. Or move from memory to register, use register in expression, store result in memory (separate stream by the way). IOW, the front part of your loop would contain (possibly) 8 memory stalls before the HW prefetcher recognized the stream. With the "hack" you preconditioned the HW pre-fetcher to recognize the input stream. As to if this saves sufficient time for a large block memmove is arguable, however when used in a read, expression, write, it may be more beneficial.

Note, using the _mm_prefetch, you would likely have the prefetch ahead preceding every cache line in the move loop (as opposed to the 8 touch memories preceeding the loop).

>> I would like also to get a benefit for the TLB

Yes, as far as I know, the PREFETCH instruction may be insufficient to update a TLB, and deffinately will not update across a page table update request. Therefore, it may be recommended to induce the prefetch via move to register not use (at least until some time later). This can be tricky not knowing the register assignments used by the compiler, and by the compiler optimization efforts to remove "useless" code. You might experiment targeting "volatile register int prefetchRegister;", then "prefetchRegister = array[cachLineStridedIndex];". Failing that, define a MACRO that expands an equivilent asm statement.

Jim Dempsey

AbhishekD_Intel · ‎06-30-2020

Hi Roberto,

Thanks for reaching out to us. We are forwarding your issue to the concerned compiler experts. We will get back to you at earliest.

Warm Regards,

Abhishek

McCalpinJohn · ‎06-30-2020

It is important to be careful with terminology or one can easily get confused.....

What you are doing in this example is *not* HW "prefetching", it is "pre-loading". Hardware prefetching is a completely autonomous and invisible system that you cannot control or (directly) monitor. Hardware prefetching in Intel processors is extremely aggressive, so it is seldom possible to make performance significantly better -- even with heroic efforts and a good understanding of the hardware.

There are four different concepts here -- three "prefetching" plus "pre-loading":

Hardware prefetch looks at the history of addresses being loaded (or stored), computes strides, computes future addresses in sequence along those strides, and automatically generates fetches to load those addresses into some level(s) of the caches.
- In current Intel processors, hardware prefetching only operates on physical addresses and only within naturally-aligned 4KiB memory ranges.
- Hardware prefetching (at the L2 level) can generate more concurrent cache line transfers than the core (roughly 24 total L2 misses generated by the L2 HW PF engines vs 12 L1 misses generated by a core).
Software prefetch allows the user to execute an instruction that will generate a fetch for an address into some level(s) of the caches.
- This allows more precise control, and can be useful when the addresses being accessed are not contiguous and not grouped into a small number of 4KiB pages. (Small means no more than 32 streams -- last documented for Sandy Bridge -- maybe unchanged, maybe not?)
- For almost all recent Intel processors, Software prefetch instructions will cause Page Table Walks if the address is not found in the L1 or L2 TLB.
- One limitation is that SW prefetch instructions compete for the same set of 12 L1 Miss Buffers that are used for demand loads. (Multiple references to the same line will merge, but SW prefetches don't allow you to get more concurrent cache lines moving.)
Starting in Ivy Bridge, Intel processor cores support a "Next Page Prefetcher". This is a nearly completely undocumented prefetcher located in the core/L1 that appears to fetch one line from the next 4KiB page using virtual address. Fetching one cache line from the next page does two important things:
- The Next-Page-Prefetch occurs early enough that there are almost never TLB misses when loading contiguous data.
- The Next-Page-Prefetch (appears to) "prime" the L2 hardware prefetchers by giving them an access to a new 4KiB page to watch. The L2 HW prefetchers only need two accesses to a page to start prefetching, so after the Next-Page-Prefetch, the first L1 Miss to arrive at the L2 will start the generation of HW prefetches.
"Pre-loading" is executing load instructions early for addresses that will be needed later. This seems like a prefetch, but has completely different performance implications.
- In an out-of-order processor, instructions can be executed out-of-order, but must be retired in program order. (Software prefetches are typically exempt from this requirement because they are not allowed to change the semantics of a program.) Pre-loading can certainly be done early, but as soon as the re-order buffer fills, the core must stall until the oldest instruction in the reorder buffer has completed, so that it can be retired first.
- Reorder buffers are getting bigger, but this is not enough to tolerate memory latency. (L1 hit and L2 hit latencies are typically completely overlappable, and L3 hit latencies are typically mostly overlapped, but out-of-order processing is seldom capable of hiding more than a small fraction of latency for main memory accesses.)

jimdempseyatthecove · ‎07-01-2020

John,

Thanks for the detailed reply.

FWIW from my experience, in the last 10 generations of Intel CPUs or so, it has been extremely difficult to observe performance improvement using software prefetches. In fact the experience has been it can be counter-productive. The HW prefetching is that good. Software preloading, on the other hand, is beneficial when it is done correctly.

Jim Dempsey

roberto_g_2 · ‎07-02-2020

Jim and John,

thank you, I read with much interest your inspiring comments. In order to give a context to what I wrote, I am implementing a callback that fills a driver’s buffer using streaming stores so as to leave L2 clean to the driver.

My baseline is to run the sequence of callbacks on a large data stream with an idle driver, and then compare the performance (with vtune) when the callbacks are actually interspersed with the active driver’s instructions. In the latter case, I would like to achieve similar performance as if it were the baseline.

No intention to beat the hw prefetching in a sequential scan, I have no hopes... But when a callback is executed I am hopefully “encouraging” the hw prefetcher to follow the callback’s sequential pattern (which could have been otherwise altered implicitly by the driver’s instructions).

Most of the computation in the baseline is memory bound as one could expect. Prefetching helps a lot with interspersed calls (I am launching _mm_prefetch on L3 and then NTA ahead of time). Actually, I noted in vtune microarchitecture exploration that TLB does not create issues, even if the data stream is quite large and thus many pages are virtual alloc’d (checked with 2Gb). So this is good, probably due to the “Next Page Prefetcher”, which I did not know before.

I am aware that I wrongly called hardware prefetching what it is actually pre-loading, but it seems to do its job with interspersed calls. Clearly it does not improve the performance in the baseline (actually the number of loads increases) but this gives more stability in the steaming stores when callbacks are intermixed with driver’s instructions.

For the memmove, I am using a macro as I have better control on the generated code. Declaring a volatile var is an interesting option that I did not think of, great!

I measured the benefit of NTA prefetching with the streaming stores as the callbacks are intermxied. With regular stores I do not know...

Cheers

-Roberto