I am attempting to understand the Intel memory model to allow me to write some multithreaded code. My aim is to copy some data into a buffer and set an index to allow another thread to access the data. The data is being copied using a rep movs command. I have tried to use a xchg command to store the index value.
mov edi,edx mov ecx, 0x12 rep movs DWORD PTR es:[edi], DWORD PTR ds:[esi] xchg DWORD PTR [ebx+0xdc],eax
Alternatively, I have tried using a mov followed by a lock command.
mov edi,edx mov ecx, 0x12 rep movs DWORD PTR es:[edi], DWORD PTR ds:[esi] mov DWORD PTR [ebx+0xdc],eax lock or DWORD PTR [esp],0x0
From my testing the xchg version does not seem to work as I hoped, but the mov and lock version does. From reading the Intel® 64 and IA-32 Architectures Software Developer’s Manual it would seem they should be equivalent. Is there a subtle difference between the xchg and a mov and lock methods?
From examples 8-13 and 8-14 the string movs commands are not reordered with the other store commands and 126.96.36.199 and 188.8.131.52 state that the lock and xchg instructions cannot be reordered with the other stores. Therefore, I assume the processor cannot reorder the commands in the code examples above.
However, what is not clear is what another processor may see.
From the section “8.1 Lock operations” “Because frequently used memory locations are often cached in a processor’s L1 or L2 caches, atomic operations can often be carried out inside a processor’s caches without asserting the bus lock. Here the processor’s cache coherency protocols ensure that other processors that are caching the same memory locations are managed properly while atomic operations are performed on cached memory locations.”
Also from “184.108.40.206 Software Controlled Bus Locking” “locked operations serialize all outstanding load and store operations (that is, wait for them to complete)”
From “8.2.5 Strengthening or Weakening the Memory-Ordering Model” “Locking operations typically operate like I/O operations in that they wait for all previous instructions to complete and for all buffered writes to drain to memory”
From “8.3 SERIALIZING INSTRUCTIONS” “The processor does not write back the contents of modified data in its data cache to external memory when it serializes instruction execution.”
Sections 8.2.5 and 8.3 seem to contradict each other?
Do the above statements mean that the lock operations (either lock or xchg) ensure all previous commands are complete, but they do not ensure their effects will be visible to another processor? If this is the case how can I ensure the data in my buffer will be visible to another processor when the index is changed.
- Parallel Computing
The wording in 8.2.5 is a little misleading. The phrase "all buffered writes to drain to memory" should be interpreted as "the contents of all store buffers are committed to the cache (and are therefore globally visible)".
The wording in 8.3 about writing modified data from cache to memory is not relevant to ordering. Ordering is determined only by "visibility", not by the location of the data in the cache hierarchy.
My reading of Chapter 8 is that your first example should work. The XCHG instruction is implicitly locked, and example 8-13 says clearly that other processors should not be able to see the results of the store in the XCHG operation until after all of the stores implied by the REP MOVS instruction have become globally visible.
Shared-memory synchronization is notoriously tricky, and sometimes the logical error is outside of the code that you are looking at. (I have made mistakes in the "consumer" code that I did not notice because I was focusing all my attention on the "producer" code. I have made mistakes in the initialization code, too, but those are usually catastrophic, rather than subtle.)
In this case, using the REP MOVS instruction (with its potential to weaken the memory ordering) adds another potential problem area.
- It might be interesting to try disabling the "fast string operations" in the IA32_MISC_ENABLE MSR.
- It might also be interesting to replace the REP MOVS with a loop of "ordinary" load and store operations.
- If the unexpected behavior is only occasional, then adding explicit MFENCE instructions can help identify exactly where the ordering is not matching your expectations.
- "Visibility" always refers to what other cores see.
- Section 8.2.2 says that "writes by a single processor are observed in the same order by all processors".
- What the local core sees from its own operations (ignoring any operations on memory updated by other cores) is just the serial ordering of the results of its own instruction stream.
Your second example is incorrect should the streaming stores (if being used) stores after the non-streaming store of the move on line 4. In this case the other threads may see the "ready" placed into [ebx+0xdc] prior to the rep movs... completes. To correct for this, remove line 5 (lock or) and place the lock on line 4.
Note, re McCalpin: ...made mistakes in the "consumer" code...
Should your consumer code wait until [ebx+0xdc] becomes the value contained in eax, this requires that all the other threads observe something other than that value (or other "go" value) contained in that location.
Thank you, John and Jim for your comments.
John you have cleared up my understanding of some of the comments in the software developer’s manual. You have confirmed what I had hoped. I now only use method one, using the xchg instruction to indicate the block of data has been copied by storing the new index in [ebx+0xdc].
Thanks Jim for pointing out that other processors may see the new index before the data copy is complete in method 2. Method one was my preferred method and I only attempted method two to check my understanding. However, in my example it did seem to fix my problem. This just shows that when developing shared-memory synchronization algorithms one should not rely on testing as this can lead to the wrong conclusions.
Using the Relacy race detector quickly showed there was a flaw in the surrounding algorithm which was easily fixed. Since making the corrections to the algorithm and using method one I have not seen a single failure in weeks of continuous testing.
Thanks again for all your help.
>> However, in my example it did seem to fix my problem.
I am glad you discovered the common fallacy of "seems to work". Dmitry's Relacy Race Detector is a good verification tool.