When a memory region is mapped as Write Combining (for example, using the pgprot_writecombine() API of PAT), are all the writes to that region buffered through one 64-byte WC buffer? In other words, how many WC buffers are mapped to one WC memory region?
My question arises from trying to understand certain InfiniBand behavior. When I open an InfiniBand device, a few pages from the NIC's address space are mmap()ed into the userspace. For each page, the kernel calls io_remap_pfn_range() with the protocol being set to Write Combining (return value of pgprot_writecombine()). So each page is a separate WC memory region. Does this mean that all the writes to page_i are buffered through one 64-byte buffer and all the writes to page_j are buffered through another 64-byte buffer?
Another way to ask this question: if two CPUs write 64 bytes to different parts of the same mmap()ed page, will their 64-byte writes to the device essentially be serialized since they are being buffered through one 64-byte WC buffer?
I am aware there are about 10 WC buffers per core. But I am not sure who maps the buffers to the different WC memory regions and how they are done.
Any pointers or thoughts will be more than helpful! Thanks!
The precise number of Write-Combining buffers is not an architectural feature. It is sometimes documented (see the last paragraph for examples) but typically with warnings about not counting on specific values. For example, with HyperThreading enabled and both threads active, the effective number of write-combining buffers available for each thread is halved.
McCalpin's advice: The basic optimization rule is that, whenever possible, code should be structured to write all 64 Bytes to a single aligned 64-Byte region using consecutive instructions before performing any of the writes to another 64-Byte region. Interleaving sub-cache-line stores across multiple target addresses might overflow the number of write-combining buffers available. Even if it does not overflow the number of available write-combining buffers, interleaving across multiple write combining buffers means that all of the buffers will need to be written out in pieces in the event of an interrupt or other event that causes the WC buffers to flush, instead of having a maximum of one partially-filled buffer if they are written one at a time. Partial writes reduce overall efficiency, sometimes significantly, so trying to avoid them helps. It is OK for a core to interleave WC stores across multiple address streams as long as the code writes all 64 aligned Bytes to one address stream before moving to the next.
Each core has its own write-combining buffers, so having two cores writing to different cache line addresses in a WC-mapped MMIO region will not result in any conflicts. If two cores are writing to different addresses within the same cache line, the target device (or target memory controller, for streaming stores to WB memory) will have to perform multiple merge cycles. This will happen sometimes (e.g, if a process is migrated from one core to another in the middle of a set of WC writes), so it does work correctly, but it can be slow enough that software should not do it on purpose.
For older processors, Table 11-1 of Volume 3 of the Intel Architectures Software Developer's Manual (document 325384, revision 065) lists values between 4 and 8 entries. Section 11.3.1 of that volume talks a bit more about write-combining, and several sections of the Intel Optimization Reference Manual (document 248966) discuss ways to maximize throughput of software that uses write combining.
I'm not sure we understand what OP is getting at. If 2 CPUs are writing to the same 64-byte memory region, there is definitely a false sharing problem, if not a race condition. I believe it's inefficient for 2 CPUs to write into the same page (4K bytes, unless using huge pages); at the very least, you don't take advantage of the possibility of each core having a distinct active page list and so getting better coverage. Also, you are fighting the design of a NUMA node by forcing remote memory writes. It would not make sense to design an infiniband protocol such as to force remote memory useage. If you are writing with nontemporal stores, you may not have the problem of requiring a distance of several cache lines between the write streams so as not to evict prefetched read for ownership cache lines before you complete updates of them (several pages apart in view of next page prefetching on recent CPUs).
All Xeon CPUs since Woodcrest have 10 fill buffers per core; those take the place of the WC buffers of older CPU models. If you need to write separate memory streams without taking precautions to avoid partial cache line writes, you should aim not to use more than 8 or 9 fill buffers, so as to give the last filled buffer time to flush before needing to re-use it. It doesn't seem this is the case you are interested in. With AVX512 aligned stores, it may be possible to bypass the fill buffers and avoid write combining entirely, as you accomplish John's suggestion inherently. I'm somewhat surprised not to have seen this discussed. This effect may be necessary to get any advantage from AVX512 when write performance is important. As John pointed out, when multiple fill buffers are needed, they are likely to become a serious potential choke point for hyperthreading.
@John McCalpin: the addresses of the writes within the mapped page are cache-aligned. So the two CPUs are writing 64 bytes to different cache lines. So, from your third paragraph, it seems like the 64-byte writes of the two CPUs are going into their own WC buffers. So the flush of one buffer shouldn't affect the flush of another. When I run the application, I bind the threads to the core. So no CPU migration occurs.
@Tim P.: The two CPUs are writing 64 bytes to different offsets within the same page; so, there is no race. And the offsets within the same page are 256 bytes apart; so, there is no false-sharing. More importantly, the pages are mapped as Write Combining pages, so the memory is not in the cache hierarchy. The writes to the Write Combining memory is buffered through the WC/fill buffers. The two threads are in the same socket, so no remote NUMA-node memory is accessed. On the machine that I am running on, there is only HW thread per core. So no hyperthreading issues here.
These writes that I am talking about are the DoorBell + BlueFlame writes, if you are familiar with InfiniBand and Mellanox terminology. The Doorbell write is 64 bits. The BlueFlame write is 64 bytes and it is written using 64-bit atomic writes.
I see a 15% drop in performance when 2 CPUs write 64 bytes to the same page VS when the two CPUs write 64 bytes to different pages. The write to device memory happens in the following pattern:
write32bits() // a MMIO region that must be written before the following 64-byte write
From your answers, seems like WC buffering and flushing is not causing a drop in the throughput when both the CPUs write64bytes_using_atomic64bit() on the same page. Not sure what is causing this drop then. Any thoughts?
From the processor's perspective, there should be no problem with having multiple cores writing to different cache lines in the same 4KiB page. It should not make any difference if the writes are to a memory page or an MMIO page.
For system memory accesses, the performance may vary depending on the specific core and the specific address, since the cache coherence is distributed around the chip using an unpublished hash. If you don't control for the location of the cores and the physical addresses being accessed, you can see confusing performance variations.
For MMIO space, I have not tried to work out the message flow on recent Intel processors. It should be possible to track the traffic using the mesh traffic counters, but this is a fairly challenging exercise. (The slides from my April 12 2018 presentation should be up at https://www.ixpug.org/working-groups "Real Soon Now".)
That was a long way of saying that I don't know why you are seeing a performance drop when two cores write to the same MMIO page. There may be hidden conflicts in the specific addresses you are using or in the implementation of SFENCE, or it may be something else entirely...