Solved: Re: How does CPU determine when to do a write action in the case of write-combining?

zhed · ‎10-29-2020

Hi,

I have a question about write-combining. In the case of write-combinging(mapping with remap_pfn_range and pgprot_writecombine), CPU may buffer several writes within cache-aligned 64 bytes and then do a single real write. How does the CPU decide to do a real write? Does it do the real write after it recieve a write on some other 64 bytes which are different from previous buffered writes? I just guess so. Could any expert answer my question?

Thanks.

Zhe

McCalpinJohn · ‎10-30-2020

The details are not fully disclosed, though some hints are given.

Serializing Instructions: SWDM Volume 3, Section 8.3

"These instructions force the processor to complete all modifications to flags, registers, and memory by previous instructions and to drain all buffered writes to memory before the next instruction is fetched and executed." (emphasis added)
There are only 3 serializing instructions available in user mode: CPUID, IRET, RSM. Of these, only CPUID can be used whenever you want. The CPUID instruction may require a few hundred cycles to execute, so it is not an efficient way of flushing the WC buffers, but it is guaranteed to work.
There are many more serializing instructions in kernel mode -- most of which are also expensive -- but the return from kernel mode (IRET) is serializing, so any kernel call that is not implemented as a VDSO will result in flushing the WC buffers before returning to user space.

Buffering of Write-Combining Memory Locations: SWDM Volume 3, Section 11.3.1

"The size and structure of the WC buffer is not architecturally defined. For the Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4 and Intel Xeon processors; the WC buffer is made up of several 64-byte WC buffers."
- This says that there are only a small number of WC buffers, so after you have stored something into each of them, any subsequent write to a different cache line in the WC region will require that at least one WC buffer be flushed.
- There is no guarantee that you can combine writes into more than one cache line at a time. Intel recommends writing to one cache line at a time and placing the store instructions for the 64 Bytes of the line as close together as possible in the code.
- The effective number of write-combining buffers can differ across products, or across microcode revisions for the same processor.
"When software begins writing to WC memory, the processor begins filling the WC buffers one at a time. When one or more WC buffers has been filled, the processor has the option of evicting the buffers to system memory."
- The working here is unusual, but intentional. Without going into implementation details or providing guarantees, Intel is saying that the "normal" response to filling a WC buffer is to evict it to memory. This is not guaranteed to happen immediately or synchronously, but it is the recommended approach to minimize the average latency between the store instructions in the core and the subsequent eviction of the line to memory.
"The protocol for evicting the WC buffers is implementation dependent and should not be relied on by software for system memory coherency."
- The wording here is a little scary. It does not mean that software has no control over evicting the WC buffers, but it does mean that the control is indirect. In fact, software is required to follow certain rules to ensure the desired memory ordering constraints.
- For the STREAM benchmark, when the Intel compiler generates streaming (write-combining) stores, an MFENCE instruction is inserted immediately after the loop containing the WC stores. This may or may not "force" eviction of the WC buffers, but it does ensure that the processor will not continue executing instructions until the WC buffers have been evicted.

So the processor can flush the WC buffers whenever it wants and you can't prevent that. You can minimize the average latency by writing full buffers, and minimize the chance of undesired flushes by placing the store instructions as close together as possible. (Using "wider" store instructions reduces the occurrence of undesired flushes due to interrupts, with my testing showing zero undesired flushes when using 512-bit AVX512 streaming stores.)

View solution in original post

McCalpinJohn · ‎10-30-2020

The details are not fully disclosed, though some hints are given.

Serializing Instructions: SWDM Volume 3, Section 8.3

"These instructions force the processor to complete all modifications to flags, registers, and memory by previous instructions and to drain all buffered writes to memory before the next instruction is fetched and executed." (emphasis added)
There are only 3 serializing instructions available in user mode: CPUID, IRET, RSM. Of these, only CPUID can be used whenever you want. The CPUID instruction may require a few hundred cycles to execute, so it is not an efficient way of flushing the WC buffers, but it is guaranteed to work.
There are many more serializing instructions in kernel mode -- most of which are also expensive -- but the return from kernel mode (IRET) is serializing, so any kernel call that is not implemented as a VDSO will result in flushing the WC buffers before returning to user space.

Buffering of Write-Combining Memory Locations: SWDM Volume 3, Section 11.3.1

"The size and structure of the WC buffer is not architecturally defined. For the Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4 and Intel Xeon processors; the WC buffer is made up of several 64-byte WC buffers."
- This says that there are only a small number of WC buffers, so after you have stored something into each of them, any subsequent write to a different cache line in the WC region will require that at least one WC buffer be flushed.
- There is no guarantee that you can combine writes into more than one cache line at a time. Intel recommends writing to one cache line at a time and placing the store instructions for the 64 Bytes of the line as close together as possible in the code.
- The effective number of write-combining buffers can differ across products, or across microcode revisions for the same processor.
"When software begins writing to WC memory, the processor begins filling the WC buffers one at a time. When one or more WC buffers has been filled, the processor has the option of evicting the buffers to system memory."
- The working here is unusual, but intentional. Without going into implementation details or providing guarantees, Intel is saying that the "normal" response to filling a WC buffer is to evict it to memory. This is not guaranteed to happen immediately or synchronously, but it is the recommended approach to minimize the average latency between the store instructions in the core and the subsequent eviction of the line to memory.
"The protocol for evicting the WC buffers is implementation dependent and should not be relied on by software for system memory coherency."
- The wording here is a little scary. It does not mean that software has no control over evicting the WC buffers, but it does mean that the control is indirect. In fact, software is required to follow certain rules to ensure the desired memory ordering constraints.
- For the STREAM benchmark, when the Intel compiler generates streaming (write-combining) stores, an MFENCE instruction is inserted immediately after the loop containing the WC stores. This may or may not "force" eviction of the WC buffers, but it does ensure that the processor will not continue executing instructions until the WC buffers have been evicted.

So the processor can flush the WC buffers whenever it wants and you can't prevent that. You can minimize the average latency by writing full buffers, and minimize the chance of undesired flushes by placing the store instructions as close together as possible. (Using "wider" store instructions reduces the occurrence of undesired flushes due to interrupts, with my testing showing zero undesired flushes when using 512-bit AVX512 streaming stores.)

zhed · ‎10-30-2020

Hi, Dr. Bandwidth
Thank you for your sharing. Very helpful!
Thanks
Zhe