Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Asymmetry in non-temporal/streaming load/store intrinsics?

roberto_g_2
New Contributor I
2,020 Views

For streaming operations intrinsics offer both _mm256_stream_si256 and _mm256_stream_load_si256.

I found also _mm256_stream_ps but I cannot find its counterpart _mm256_stream_load_ps ... any reason for this asymmetry?  

Thanks :)

-Roberto

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
2,020 Views

The meaning of "stream" is different for loads and stores -- symmetry is not there in the underlying functionality, so it would be confusing if the intrinsics implied it.

For stores:  "streaming" (non-temporal) stores bypass all levels of cache (invalidating the target line if present) and write the data directly to DRAM.   These "streaming stores" were originally developed for memory-mapped IO -- to allow a processor to write directly to the graphics frame buffer at relatively high speed.   Later, access to this functionality was made available for cached memory by using the "non-temporal" versions of the store instructions.

For loads: "non-temporal" can mean two different things:

  1. Put the data in the cache, but assume that it will only be used once.  The usual way to implement this is to load the data, but bias the "Least-Recently-Used" bits so this newly loaded line will be considered "least recently used" rather than the default "most recently used".  The PREFETCHNTA instruction appears to do something like this.  PREFETCHNTA also pulls the data into the L1 Data Cache, but does not put the data in the L2 or L3 (except where required for inclusion).
  2. Don't put the data in the cache -- put it in a separate cacheline-sized buffer to allow multiple contiguous (partial cache line) loads.   This is what the MOVNTDQA and VMOVNTDQA instructions are for.  They are intended for use with the "Write Combining" (WC) memory type, which is normally used for Memory-Mapped IO, not for system memory.   Normally, loads from WC are uncached.  By using the *MOVNTDQA loads, the system can move full cache lines from WC memory and service multiple loads from the same line from the buffer.

You should read the description of the MOVNTDQA instruction in Volume 2 of the Intel Architectures SW Developers Reference Manual -- it says that the instruction applies a non-temporal hint "if WC memory type".   It would be possible for Intel to implement this instruction to provide something like PREFETCHNTA semantics when operating on ordinary Write-Back (WB) memory, but the description here suggests that the instruction is only treated specially for WC memory.

View solution in original post

7 Replies
PrasanthD_intel
Moderator
2,020 Views

Hi Roberto,

As of now, stream_load takes datatypes si128, si256, si512 using which you can load the respective sizes of vectors.

For  _mm256_stream_load_ps support let me check with the concerned team.

Meanwhile, you can write a loop to combine floats and pass them to _mm256_stream_load_si256.

Regards

Prasanth

 

 

0 Kudos
roberto_g_2
New Contributor I
2,020 Views

Thank you Prasanth, I am wondering if I can get a similar effect to the missing _mm256_stream_load_ps by employing a suitable combination of prefetching with _MM_HINT_NTA and executing a regular _mm256_load_ps .

Kind regards

Roberto


 

0 Kudos
McCalpinJohn
Honored Contributor III
2,021 Views

The meaning of "stream" is different for loads and stores -- symmetry is not there in the underlying functionality, so it would be confusing if the intrinsics implied it.

For stores:  "streaming" (non-temporal) stores bypass all levels of cache (invalidating the target line if present) and write the data directly to DRAM.   These "streaming stores" were originally developed for memory-mapped IO -- to allow a processor to write directly to the graphics frame buffer at relatively high speed.   Later, access to this functionality was made available for cached memory by using the "non-temporal" versions of the store instructions.

For loads: "non-temporal" can mean two different things:

  1. Put the data in the cache, but assume that it will only be used once.  The usual way to implement this is to load the data, but bias the "Least-Recently-Used" bits so this newly loaded line will be considered "least recently used" rather than the default "most recently used".  The PREFETCHNTA instruction appears to do something like this.  PREFETCHNTA also pulls the data into the L1 Data Cache, but does not put the data in the L2 or L3 (except where required for inclusion).
  2. Don't put the data in the cache -- put it in a separate cacheline-sized buffer to allow multiple contiguous (partial cache line) loads.   This is what the MOVNTDQA and VMOVNTDQA instructions are for.  They are intended for use with the "Write Combining" (WC) memory type, which is normally used for Memory-Mapped IO, not for system memory.   Normally, loads from WC are uncached.  By using the *MOVNTDQA loads, the system can move full cache lines from WC memory and service multiple loads from the same line from the buffer.

You should read the description of the MOVNTDQA instruction in Volume 2 of the Intel Architectures SW Developers Reference Manual -- it says that the instruction applies a non-temporal hint "if WC memory type".   It would be possible for Intel to implement this instruction to provide something like PREFETCHNTA semantics when operating on ordinary Write-Back (WB) memory, but the description here suggests that the instruction is only treated specially for WC memory.

roberto_g_2
New Contributor I
2,020 Views

Thank you John for your crystal clear explanation! I previously read your notes on streaming stores but I was still confused on the semantics of streaming loads.

I needed point 1 that you mention, and I found very useful too reading chapter 8 in the document Intel 64 and IA-32 Architectures Optimization Reference Manual.

0 Kudos
PrasanthD_intel
Moderator
2,020 Views

Hi Roberto,

Looks like your query has been answered, can we close this thread?

-Prasanth

 

0 Kudos
roberto_g_2
New Contributor I
2,020 Views

Sure, we can close it, thank you to all

-Roberto

0 Kudos
PrasanthD_intel
Moderator
2,020 Views

Hi Roberto,

Glad to know your query is resolved. We are closing the thread now.

Raise a new thread for any further queries

Regards

Prasanth

0 Kudos
Reply