- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For streaming operations intrinsics offer both _mm256_stream_si256 and _mm256_stream_load_si256.
I found also _mm256_stream_ps but I cannot find its counterpart _mm256_stream_load_ps ... any reason for this asymmetry?
Thanks :)
-Roberto
- Tags:
- CC++
- Development Tools
- General Support
- Intel® C++ Compiler
- Intel® Parallel Studio XE
- Intel® System Studio
- Optimization
- Parallel Computing
- Vectorization
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The meaning of "stream" is different for loads and stores -- symmetry is not there in the underlying functionality, so it would be confusing if the intrinsics implied it.
For stores: "streaming" (non-temporal) stores bypass all levels of cache (invalidating the target line if present) and write the data directly to DRAM. These "streaming stores" were originally developed for memory-mapped IO -- to allow a processor to write directly to the graphics frame buffer at relatively high speed. Later, access to this functionality was made available for cached memory by using the "non-temporal" versions of the store instructions.
For loads: "non-temporal" can mean two different things:
- Put the data in the cache, but assume that it will only be used once. The usual way to implement this is to load the data, but bias the "Least-Recently-Used" bits so this newly loaded line will be considered "least recently used" rather than the default "most recently used". The PREFETCHNTA instruction appears to do something like this. PREFETCHNTA also pulls the data into the L1 Data Cache, but does not put the data in the L2 or L3 (except where required for inclusion).
- Don't put the data in the cache -- put it in a separate cacheline-sized buffer to allow multiple contiguous (partial cache line) loads. This is what the MOVNTDQA and VMOVNTDQA instructions are for. They are intended for use with the "Write Combining" (WC) memory type, which is normally used for Memory-Mapped IO, not for system memory. Normally, loads from WC are uncached. By using the *MOVNTDQA loads, the system can move full cache lines from WC memory and service multiple loads from the same line from the buffer.
You should read the description of the MOVNTDQA instruction in Volume 2 of the Intel Architectures SW Developers Reference Manual -- it says that the instruction applies a non-temporal hint "if WC memory type". It would be possible for Intel to implement this instruction to provide something like PREFETCHNTA semantics when operating on ordinary Write-Back (WB) memory, but the description here suggests that the instruction is only treated specially for WC memory.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Roberto,
As of now, stream_load takes datatypes si128, si256, si512 using which you can load the respective sizes of vectors.
For _mm256_stream_load_ps support let me check with the concerned team.
Meanwhile, you can write a loop to combine floats and pass them to _mm256_stream_load_si256.
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you Prasanth, I am wondering if I can get a similar effect to the missing _mm256_stream_load_ps by employing a suitable combination of prefetching with _MM_HINT_NTA and executing a regular _mm256_load_ps .
Kind regards
Roberto
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The meaning of "stream" is different for loads and stores -- symmetry is not there in the underlying functionality, so it would be confusing if the intrinsics implied it.
For stores: "streaming" (non-temporal) stores bypass all levels of cache (invalidating the target line if present) and write the data directly to DRAM. These "streaming stores" were originally developed for memory-mapped IO -- to allow a processor to write directly to the graphics frame buffer at relatively high speed. Later, access to this functionality was made available for cached memory by using the "non-temporal" versions of the store instructions.
For loads: "non-temporal" can mean two different things:
- Put the data in the cache, but assume that it will only be used once. The usual way to implement this is to load the data, but bias the "Least-Recently-Used" bits so this newly loaded line will be considered "least recently used" rather than the default "most recently used". The PREFETCHNTA instruction appears to do something like this. PREFETCHNTA also pulls the data into the L1 Data Cache, but does not put the data in the L2 or L3 (except where required for inclusion).
- Don't put the data in the cache -- put it in a separate cacheline-sized buffer to allow multiple contiguous (partial cache line) loads. This is what the MOVNTDQA and VMOVNTDQA instructions are for. They are intended for use with the "Write Combining" (WC) memory type, which is normally used for Memory-Mapped IO, not for system memory. Normally, loads from WC are uncached. By using the *MOVNTDQA loads, the system can move full cache lines from WC memory and service multiple loads from the same line from the buffer.
You should read the description of the MOVNTDQA instruction in Volume 2 of the Intel Architectures SW Developers Reference Manual -- it says that the instruction applies a non-temporal hint "if WC memory type". It would be possible for Intel to implement this instruction to provide something like PREFETCHNTA semantics when operating on ordinary Write-Back (WB) memory, but the description here suggests that the instruction is only treated specially for WC memory.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you John for your crystal clear explanation! I previously read your notes on streaming stores but I was still confused on the semantics of streaming loads.
I needed point 1 that you mention, and I found very useful too reading chapter 8 in the document Intel 64 and IA-32 Architectures Optimization Reference Manual.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Roberto,
Looks like your query has been answered, can we close this thread?
-Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sure, we can close it, thank you to all
-Roberto
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Roberto,
Glad to know your query is resolved. We are closing the thread now.
Raise a new thread for any further queries
Regards
Prasanth

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page