I'm trying to write a fast function for filling a large buffer with a 128-bit vector value. I'm using movntps and I was wondering if a fence instruction is necessary for correctness and/or performance. Things appear to work fine without it, but I wonder if that's just dumb luck or if the processor detects the lack of it and ensures correctness through some kind of costly interrupt and/or microcode?
If a fence is highly recommended, should I use sfence or mfence? I couldn't find any documents with straight answers.
The fence instruction is only needed to be absolutely sure that all of the non-temporal stores are visible before a subsequent "ordinary" store. The most obvious case where this matters is in a parallel code, where the "barrier" at the end of a parallel region may include an "ordinary" store. Without a fence, the processor might still have modified data in the Write-Combining buffers, but pass through the barrier and allow other processors to read "stale" copies of the write-combined data. This scenario might also apply to a single thread that is migrated by the OS from one core to another core (not sure about this case).
I can't remember the detailed reasoning (not enough coffee yet this morning), but the instruction you want to use after the non-temporal stores is an MFENCE. According to Section 8.2.5 of Volume 3 of the SWDM, the MFENCE is the only fence instruction that prevents both subsequent loads and subsequent stores from being executed ahead of the completion of the fence. I am surprised that this is not mentioned in Section 11.3.1, which tells you how important it is to manually ensure coherence when using write-combining, but does not tell you how to do it!
It is quite common for codes to execute correctly in the overwhelming majority of cases even without the proper fencing -- especially if the data written is not referenced for a long time after the stores. (This is, not surprisingly, a primary reason for pushing the data past the caches in the first place!) But it is a very good idea to include the proper fences -- otherwise you will likely have no recollection of the potential correctness exposure when you eventually get bitten by it.
Thanks John! I'm still a little confused about requiring mfence, which is an SSE2 instruction, while movntps is SSE1. Is this simply because there were no multi-core / hyper-threaded CPUs with only SSE1? It would mean I need two code paths if I wanted to support such old processors...
As discussed in Section 8.2.5 of Volume 3 of the SW Developer's Manual, the three FENCE instructions were added as more lightweight ways of controlling memory ordering than using a fully serializing instruction such as CPUID. I get the impression that CPUID was the recommended way to enforce ordering before the fence instructions became available. The effects of the serializing instructions (most of which are privileged) are discussed in Section 8.3 of SWDM-Vol3.
When I was writing a memset years ago, I used non-temporal stores. The folks who were more expert in the details of instruction requirements insisted I use an sfence instruction after the non-temporal stores.
For instance, I used movntdq to initialize memory. The SDM volume 2 writeup for the movntdq instruction says:
"Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with the SFENCE or MFENCE instruction should be used in conjunction with MOVNTDQ instructions if multiple processors might use different memory types to read/write the destination memory locations"
Hope this helps,