Software Tuning, Performance Optimization & Platform Monitoring
Discussion around monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform monitoring

Question about normal store and non-temporal store

New Contributor I

Hi all. I just come across this term, "read-modify-write", so I start to wonder that if every write to the memory is such a kind of "read-modify-write".

Say now I initialize a memory address in my DRAM and write a bunch of data to this memory address:


int NUM =  1000000000;

int * addr = (int*) malloc(NUM * sizeof(int));

for (int i = 0; i < NUM; i ++) {

       *(addr + i) = i;



When we write data to every single address (addr + i), will the CPU go this way: first, read the address from memory to the cache, modify the data in the cache, and then write the data to memory. In other words, every write is a "read-modify-write" if the address is not present in the cache (I assume there is no need to read if the data is already present in the cache).


And if I change the write to non-temporal write:

       _mm_stream_si32(addr+i, i);

will every single write incur a read by transferring data from memory to cache?

Many thanks for the help.


0 Kudos
6 Replies
New Contributor I

Or, just wondering, if there is a way to control the cache write policy myself.

Black Belt

The "normal" store operations do perform a cache line read, a modification one or more bytes of the cache line, and a series of writebacks that eventually push the data all the way back to memory.  In this case the execution of these three components is uncoupled in time. (Execution is ordered, of course, but asynchronous.)   

Non-temporal stores have a different flow: the streaming store causes allocation of a cache line buffer to hold the data (not in the cache).  Multiple writes to the same cache line are coalesced into the buffer, until at some point the buffer is flushed.  This typically occurs when all 64 Bytes of the cache line have been written, but the flush can happen at any time.  The flush sends the data to main memory and invalidates the cache line address in any caches in the system.   If all 64 Bytes of the buffer have been written, the memory controller can simply overwrite the value in memory with the new data.  If the buffer has only been partially filled before being flushed, then the memory controller will internally perform a read/modify/write cycle to merge the new data with the original data in the byte positions that have not been modified.  

The term "read-modify-write" usually refers to an operation with some level of atomicity, such as a locked fetch-and-add instruction.   This is more similar to the "normal" store flow, except that the cache line is "locked" for the duration of the operation so that no other core can access the line until the update is complete.

New Contributor I

Hi John, thank you for the reply.

I have no idea of the "cache line buffer" that you refer to. Where is it and what is the size of it? What is its relationship with the CPU cacheline? 


Thank you.

Black Belt

Non-temporal stores use the same functionality as the "write combining buffers" described in section 11.3 (especially 11.3.1) of Volume 3 of the Intel Architectures Software Developers Manual (Intel document 325384-073, November 2020), but allow that functionality to be accessed in the "WriteBack" mode of cache operation.

The high-level functionality of the non-temporal stores is also mentioned in section 11.5.5.

The size of the write-combining buffer for several different processor generations is listed in Table 11-1.

Intel Architectures Software Developer's Manuals are available via


New Contributor I

Hi John, just wondering the term "read/modify/write" you refered to for "non-temporal store"  happens in the memory hierarchy, or CPU cache hierarchy,  or other CPU internal buffer hierarchy?

Black Belt

If a write-combining buffer is flushed without being full, the core sends one or more transactions to the memory controller that define the bytes that need to be updated and the values of those bytes.  These transactions are not externally visible, so they are not publicly documented in detail (note 1).  The memory controller (or the DRAM channel controller in the memory controller) performs the read-modify-write cycle(s) (note 2).  The responsibility for ordering the associated coherence transaction (invalidating the cache line in all caches and waiting for confirmation of those invalidations) will likely be shared between the Home Agent and the Memory Controller (but may vary between implementations).


  1. A modest amount of information is available via the op-code filtering features described in the uncore performance monitoring manual for the processor family you are using, but it requires more than a little bit of microarchitecture experience and generally a fair amount of experimentation to understand this material.)
  2. The number of RMW cycles needed will depend on the pattern of modified bytes that need to be transferred from the core to the memory controller, the details of the transactions available for carrying the modified bytes from the core to the memory controllers, and (in cases requiring more than a single transaction) on the degree of pressure on the memory controller's internal write buffers.