I have a question about memory ordering instructions that I hope you can help me with. I am reading the Intel 64 and IA32 architectures software developer manual, and the entry on the mfence instruction says that: "This serializing operation guarantees that every load and store instruction that precedes in program order the MFENCE instruction is globally visible before any load or store instruction that follows the MFENCE instruction is globally visible."
Can you be more clear about the definition of "program order"? Does this mean instructions with lower or higher memory addresses relative to the mfence instruction, or does this mean the set of all load store instructions that have been executed previously in time on this processor?
I know it seems like a dumb question, but let me give you some background to help you understand why I'm bothering to ask you. I have the following scenario: my application has one thread supplying data, as many threads as there are processors processing data (the operation is quite computationally intensive) and another thread writing data to a disk. I observed a mysterious and very difficult to reproduce error that I attributed to memory ordering. The behavior was consistent with one of my processing threads processing uninitialized buffer contents. Anyway, here (finally) is the point of my asking the original question: my data reading/supplying thread calls a function that takes a pointer and fills a buffer. Provided of course that I have done a thorough job of thread synchronization aside from this, does it suffice to place an mfence instruction after this function call to ensure that that buffer contents are reflected in main memory, or do I need to modify the buffer-filling function itself to include the mfence instruction?
Any feedback would be greatly appreciated. I really want to be clear since this problem is so very difficult to reproduce. It is hard for me to tell whether I've fixed it or not, since even an extensive test *before* the attempted fix was applied did not reproduce the behavior.
Ok... that's it. Thank you so very much in advance for your time helping me and reading this long message.
Oh- and just for your reference, I am testing/developing on my Core 2 Duo T5600 processor.
Have you had a chance yet to look at volume 3A: System Programming Guide, Vol. 1? t
The section on Memory Ordering (in my copy, section 7.2) explains how it works in the several processor generations.
Another response relayed from a differentengineer:
> Can you be more clear about the definition of "program order"? Does this mean instructions with lower or higher memory addresses relative to the mfence instruction, or does this mean the set of all load store instructions that have been executed previously in time on this processor?Program orderin this case isthetime order of accesses as performed by one thread of execution (i.e., as performed by one processor).
Can I make an update/clarification to my original question? I have had some more thoughts about my problem and I've done some more reading and research.
I've been reading the Intel 64 Architecture Memory Ordering White Paper(document # 318147-001, August 2007) and I believe that I have run into undesirable behavior that essentially is the content of section 2.4, entitled "Intra-processor forwarding is allowed". I have the following scenario: one thread provides data to buffers, then as many threads as there are processors process this data (== 2 on my machine, a T5600 core 2 duo) and this process is quite computationally intensive. Finally, another thread writes the processed data out to a disk.
Anyway, I believe what happens is that once in a while, processed data is stuck in the L1 cache of the processor that is not running the writer thread, and then when the writer thread copies this buffer, it is not up to date. Or perhaps it is just speculative reading that has mis-ordered memory? Anyway, can anyone give me guidelines on the best way to fix this? I have a number of solutions in mind, but I am having a little trouble with ambiguity in the documentation. For example according to the Intel 64 and IA-32 Architectures Software Developer's Manual, the MFENCE instruction orders both stores and loads. The manual says "This serializing operation guarantees that every load and store instruction that precedes in program order the MFENCE instruction is globally visible before any load or store instruction that follows the MFENCE instruction is globally visible." But when two processors are running in parallel I do not know what "program order"means. Does MFENCE only serialize only for the processor that it is executed on, or does it do more and guarantee that all stores and loads *from all processors* are globally visible before subsequent loads and stores? The question is, is MFENCE enough to fix this problem, or do I need to use CLFLUSH, surrounded by MFENCE instructions? Or is there even a solution at all when the buffers are of memory type WB? Any information would be greatly appreciated. Thanks in advance.