When automatic prefetcher loads some data from memory into the cache, is it going to mark loaded cache lines as modified? If clflush is then executed on an address within the cache line or wbinvd is executed, are the prefetched lines going to be written back to memory or not?
I would assume that prefetcher shouldn't mark the lines as modified and consequently clflush shouldn't write them back to memory. However a problem I'm seeing when trying to flush some prefetched device memory suggests otherwise.
As discussed in the Intel Optimization Reference Manual, recent Intel processors have 4 or 5 hardware prefetchers that can operate on data (rather than instructions). The details are not published, but most have been investigated at varying levels of detail...
Intel's descriptions of the L1 HW prefetchers don't mention the possibility of prefetching for stores -- only loads. Prefetches for load streams should follow the usual rule of installing the line in Exclusive state if there are no other sharers, and in Shared state if there are other sharers. A CLFLUSH on an address that is in E or S state in a local cache will typically have to notify the L3 and/or the Snoop Filter that the line has been invalidated, but no data transfers should be involved -- the line should be invalidated without needing to be written back.
There are descriptions of hardware RFO (Read For Ownership) prefetches for the L2/L3 prefetchers in the "Intel Xeon Processor Scalable Memory Family Uncore Performance Monitoring Reference Manual". I have seen systems that have implemented prefetch for store in such a way that the lines are always installed in Modified state in the cache. (This seems like a poor choice, but I have definitely seen it....). I have never tried to measure this on Intel processors.
The operation of CLFLUSH on MMIO device memory is a potential mine field, especially with inclusive caches and/or snoop filters. The WB (WriteBack) type is specifically prohibited on every x86 system that I have worked with -- Store transactions are converted into PCIe writes, but WriteBack transactions are not (and cause the system to hang). Device memory can be mapped as WT (WriteThrough) or WP (WriteProtect). Both of these modes write through to memory, but WT will update a store target in the local cache if the line is already present, while WP will invalidate all store targets in all caches. Reads are cacheable and should not cause problems. The HW prefetchers have access to the MTRRs (so that they won't prefetch in uncacheable ranges), but I don't know if this inhibits or modified their behavior for WT and WP memory types.
Lots of questions raised, and few answers are available. What is your memory type in this case? How does the problem present itself?
Thank you for the details. In my case these are loads that are prefetched. The memory is local DDR memory on a PCIe device (memory-mapped to cpu via a 64bit prefetchable BAR).
The problem is as follows.
Device performs some operation on a device memory buffer which is also mapped to the CPU and marked as cached. On the CPU we have a code running that waits for the device to finish writing to the buffer. Then it issues a clflush for the whole range touched by the device, expecting that it will invalidate any entries in the cache that prefetcher might have read (which could potentially contain partial results). With that done, in some cases, we do get partial results which suggests that either clflush didn't work for some reason or that prefetcher loaded those as modified and then clflush just wrote the partial results to memory.
Your SW architecture sounds very similar to what I used to do with AMD processors, and which I am told has been used successfully with Intel processors as well (though I have not heard any recent updates in this area).
When you say that the range is "marked as cached", exactly how was this done? (What OS revision? What driver calls?). The details of the behavior depend on both the MTRR and PAT settings for the region, and it is not very much fun trying to figure out what the OS has actually done with these settings.
There is an interesting article on MMIO mapping across different systems at https://lwn.net/Articles/698014/, but it does not mention cached MMIO at all. (This is not surprising -- cached MMIO is very rare because it is so hard to get it right.)
A very useful (but old) description of what the Linux kernel means by "cacheable" in the context of MMIO regions is at http://lkml.iu.edu/hypermail/linux/kernel/0804.3/2911.html. Most importantly, "cacheable" in this context does not mean that anything will actually be placed in the caches -- it is mostly about speculation, ordering, and access size. Interestingly, the note does say:
"The CPU is allowed to write the contents from its cache back to memory at any point in time, even if the program will never actually write to the cacheline; the later is the result of speculation etc; what will be written in that case is the clean cacheline that was in the cache. (AMD cpus seem to do this relatively aggressively; Intel cpus may or may not do this)"
I have not seen this myself, but it is consistent with what I know about the philosophy of processor architects --- they don't want any restrictions on their ability to use every degree of freedom allowed by the architectural specification for a memory type.
A potentially useful paper by some folks at HPE shows what they did to get WT mode working for experiments with NVRAM: https://www.hpl.hp.com/techreports/2012/HPL-2012-236.pdf
This is on a recent Linux kernel. Mapping is done via remap_pfn_range with a vma's vm_page_prot not set to uncached (e.g. pgprot_noncached() was not used for vm_page_prot). As to MTRRs, arch_phys_wc_add() is used for the whole range of device memory making it write combine (which I think with PAT present shouldn't result in anything?).
Is there any way to monitor prefetching? I would actually like to somehow confirm this is the case instead of just guessing.
The easiest thing to do is disable the HW prefetchers. This is described at https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors
From Table 11-7 of Volume 3 of the Intel SWDM, if the MTRR type is WC, the "effective memory type" (dependent on the PAT) can only be WC or UC, so these addresses will never be placed in a processor cache. Footnote 2 notes that with MTRR type WC and a PAT type that results in an "effective memory type" of UC (UC, WT, WP), the processor caches will be snooped because of possible page aliasing. With MTRR type UC, the no caching is possible, so processors are not required to snoop their caches. This might be an interesting experiment from a correctness standpoint?