Solved: The MOVNTDQA instruction is

Nicholas_B_1 · ‎10-21-2015

I can't find any information on this anywhere. Do non-temporal load instructions (e.g. MOVNTDQA), which use the separate non-temporal store rather than the cache hierarchy, do any prefetching? How does the latency and bandwidth compare to a normal load from main memory?

Is the way to think about the store as if it is as "close" to main memory as the L3 cache, but also as "close" to the register files as the L1 cache?

McCalpinJohn · ‎10-21-2015

The MOVNTDQA instruction is intended primarily for reading from address ranges that are mapped as "Write-Combining" (WC), and not for reading from normal system memory that is mapped "Write-Back" (WB). The description in Volume 2 of the SWDM says that an implementation "may" do something special with MOVNTDQA for WB regions, but the emphasis is on the behavior for the WC memory type.

The "Write-Combining" memory type is almost never used for "real" memory --- it is used almost exclusively for Memory-Mapped IO regions.

Loads from address ranges marked as "Write Combining" cannot be cached. This is not quite as bad as loads from address ranges marked as "Strong UnCached" (UC). See section 11.3, table 11-2, and table 11-7 of Volume 3 of the SWDM. Loads from UC addresses cannot be cached, cannot be combined, and cannot be speculative. In rough terms, when the processor sees a load to an address of type UC, it waits until all prior memory operations have completed, issues the load (using a partial-cache-line load of exactly the size requested by the instruction), waits until the load has completed, and only then begins issuing later (in program order) memory operations. The latency for a read to a memory-mapped address in a PCIe device can easily be 300 ns, so the processor will be able to read an 8-Byte quantity about once every 1000 cycles. (Loads larger than 64 bits are strongly discouraged for memory-mapped IO regions, with the exception of the MOVNTDQA instruction.)

Table 11-2 says that the WC memory type does allow speculative reads. (This means that you cannot use the WC memory type for memory-mapped IO ranges that have side effects on reads, since you can't guarantee that an address will not be read unexpectedly.) The good news is that the absence of side effects means that the processor can "speculatively" read a full 64 Byte block from the device and save this in a buffer to service multiple loads from that 64 Byte range. The description of MOVNTDQA in Volume 2 of the SWDM makes it clear that Intel is not guaranteeing much here, but if you are careful you should be able to get a throughput of at least 64 Bytes per latency -- a factor of 8 improvement over what you might expect with 8-Byte uncached loads. An implementation may support more than one 64 Byte streaming load buffer, which would provide additional increases in throughput.

I have not seen evidence that the MOVNTDQA instruction does anything different than a normal MOVDQA on WB memory. It would be relatively easy to set up a few directed tests to see if the MOVNTDQA instruction helps avoid displacing data from the L1 or L2 caches. An example might be to read a large array from memory and add the values to the elements of a smaller array that you want to keep in the cache. With normal loads you would expect a significant degree of displacement if the smaller array is larger than 1/2 of the cache size. If switching to MOVNTDQA significantly reduces the miss rate, then you would have evidence that the "non-temporal" hint is being used. Figuring out exactly how it is being used is much more challenging, but a successful test would open up an interesting area for experimentation.

View solution in original post

TimP · ‎10-21-2015

There's a discussion about this load instruction in the Intel architecture developers' manual, including a description of 3 ways it may be implemented. If it is implemented identical to MOVDQA,evidently, it will prefetch into cache, but it seems you would not use the instruction unless you hoped for it to be different on a platform of interest to you.

Likewise, you can read there about non-temporal stores. There have been plenty of contradictory things said about it, but it appears to be a one-way "blind" write out of a cache line which evicts it from all levels of cache.

Nicholas_B_1 · ‎10-21-2015

I read all of the Intel manuals (Architecture Manual and Optimization Manual) on the topic several times before posting here, and the very few external resources on it (especially http://lwn.net/Articles/255364/). It's very vague about the non-temporal store, except that it is one or more cache lines. I'd like to know what the memory latency and bandwidth is of loading, rather than writing, from this path. I tried Intel Memory Latency Checker v2 but it seems to only check non-temporal writes, but not non-temporal reads.

The documentation suggests that streaming data benefits by going through this channel rather than through the caches--is that solely because it reduces cache swaps back and forth? Because it provides a "shorter" path to the processor from main memory?

As to the different possible implementations--do different processor models list something in their specifications that would indicate which one they use, or is this, too, shrouded in mystery?

TimP · ‎10-21-2015

Streaming stores eliminate the "read for ownership' (initializing each new cache line of writes to the current contents of memory). Sometimes, the corresponding reduction in memory traffic is a reasonable guess as to the benefit. You've probably noticed that Intel compilers have a streaming-stores:auto mode where the compiler decides whether to use streaming stores based on whether it sees read accesses to the same data, as well as the always (use streaming stores if possible) and no streaming stores options.

The documentation on non-temporal loads indicates they would never be used on a WB memory system, so if you can determine that characteristic of the platform, you have a clue. Maybe there is coverage of it in device driver programming guides for specific platforms which might make use of them, but those tend to be available only under non-disclosure agreements.

Nicholas_B_1 · ‎10-21-2015

Ah, hmm... So is it generally only a feature of particular specialty platforms? I swear that somewhere in there it said it could do the same thing w/WB systems and treat it as such... but that's part of the confusion.

I'm somewhat of a novice, actually, and using MSVC. So I don't know what I don't know here.

McCalpinJohn · ‎10-21-2015

The MOVNTDQA instruction is intended primarily for reading from address ranges that are mapped as "Write-Combining" (WC), and not for reading from normal system memory that is mapped "Write-Back" (WB). The description in Volume 2 of the SWDM says that an implementation "may" do something special with MOVNTDQA for WB regions, but the emphasis is on the behavior for the WC memory type.

The "Write-Combining" memory type is almost never used for "real" memory --- it is used almost exclusively for Memory-Mapped IO regions.

Loads from address ranges marked as "Write Combining" cannot be cached. This is not quite as bad as loads from address ranges marked as "Strong UnCached" (UC). See section 11.3, table 11-2, and table 11-7 of Volume 3 of the SWDM. Loads from UC addresses cannot be cached, cannot be combined, and cannot be speculative. In rough terms, when the processor sees a load to an address of type UC, it waits until all prior memory operations have completed, issues the load (using a partial-cache-line load of exactly the size requested by the instruction), waits until the load has completed, and only then begins issuing later (in program order) memory operations. The latency for a read to a memory-mapped address in a PCIe device can easily be 300 ns, so the processor will be able to read an 8-Byte quantity about once every 1000 cycles. (Loads larger than 64 bits are strongly discouraged for memory-mapped IO regions, with the exception of the MOVNTDQA instruction.)

Table 11-2 says that the WC memory type does allow speculative reads. (This means that you cannot use the WC memory type for memory-mapped IO ranges that have side effects on reads, since you can't guarantee that an address will not be read unexpectedly.) The good news is that the absence of side effects means that the processor can "speculatively" read a full 64 Byte block from the device and save this in a buffer to service multiple loads from that 64 Byte range. The description of MOVNTDQA in Volume 2 of the SWDM makes it clear that Intel is not guaranteeing much here, but if you are careful you should be able to get a throughput of at least 64 Bytes per latency -- a factor of 8 improvement over what you might expect with 8-Byte uncached loads. An implementation may support more than one 64 Byte streaming load buffer, which would provide additional increases in throughput.

I have not seen evidence that the MOVNTDQA instruction does anything different than a normal MOVDQA on WB memory. It would be relatively easy to set up a few directed tests to see if the MOVNTDQA instruction helps avoid displacing data from the L1 or L2 caches. An example might be to read a large array from memory and add the values to the elements of a smaller array that you want to keep in the cache. With normal loads you would expect a significant degree of displacement if the smaller array is larger than 1/2 of the cache size. If switching to MOVNTDQA significantly reduces the miss rate, then you would have evidence that the "non-temporal" hint is being used. Figuring out exactly how it is being used is much more challenging, but a successful test would open up an interesting area for experimentation.

Nicholas_B_1 · ‎10-21-2015

Thank you, John, that was really helpful. I was thinking primarily of the non-cache-polluting aspect of the instruction, and was curious if I could be sacrificing anything performance-wise. I see I'll have to do some experimenting!

Daniel_L_1 · ‎10-19-2016

I did some experiments as John suggested with one large and one small array and L1 cache. Please, see here for results/discussion:

http://stackoverflow.com/questions/40096894/support-for-non-temporal-loads-prefetching-from-normal-memory

http://stackoverflow.com/questions/40140728/why-intel-compiler-ignores-the-non-temporal-prefetch-pragma-directive-for-intel

McCalpinJohn · ‎10-20-2016

Thanks for posting the links to these results....

The results for the mainstream processors are not surprising -- in the absence of true "scratchpad" memory, it is not clear that it is possible to design an implementation of "non-temporal" behavior that is not subject to nasty surprises. Two approaches that have been used in the past are (1) loading the cache line, but marking it LRU instead of MRU, and (2) loading the cache line into one specific "set" of the set-associative cache. In either case it is relatively easy to generate situations in which the cache drops the data before the processor completes reading it.

Both of these approaches risk performance degradation in cases operating on more than a small number of arrays, and are made much more difficult to implement without "gotchas" when HyperThreading is considered.

In other contexts I have argued for the implementation of "load multiple" instructions that would guarantee that the entire contents of a cache line would be copied to registers atomically. My reasoning is that the hardware absolutely guarantees that the cache line is moved atomically and that the time required to copy the remainder of the cache line to registers was so small (an extra 1-3 cycles, depending on the processor generation) that it could be safely implemented as an atomic operation.

Starting with Haswell, the core can read 64 Bytes in a single cycle (2 256-bit aligned AVX reads), so the exposure to unintended side effects becomes even lower.

Starting with KNL, full-cache-line (aligned) loads should be "naturally" atomic, since the transfers from the L1 Data Cache to the core are full cache lines and all of the data is placed into the target AVX-512 register. (This does not mean that Intel guarantees atomicity in the implementation! We don't have visibility into the horrible corner cases that the designers have to account for, but it is reasonable to conclude that *most of the time* aligned 512-bit loads will occur atomically.) With this "natural" 64-Byte atomicity, some of the tricks used in the past for reducing cache pollution due to "non-temporal" loads may deserve another look....

JWong19 · ‎10-23-2016

To my understanding, software prefetch with NTA hint loads data into one-way cache (I forget the exact term in Intel's manual) of L3 cache. As you expect, it is related to cache pollution. Slightly performance gain could then be observed.

Do Non-Temporal Loads Prefetch?