Implicit/cheap memory barriers

ranis · ‎06-09-2011

Hi,

Per Intel 64 and IA-32 Architectures Software Developers Manual Volume 3 (System Programming Guide)
8.2.2 Memory Ordering in P6 and More Recent Processor Families:

A) Reads are not reordered with other reads.
B) Writes are not reordered with older reads.
C) Writes to memory are not reordered with other writes
D) Reads may be reordered with older writes to different locations but not with older writes to the same location.

Based on the above, will the following act as full memory barrier
(prevent reordering):
push eax #1 store
pop eax #2 load

Seems like based on the above rules:
1) #1 and #2 - no reorder per (D)
2) old/new reads - no reorder with #2 per (A)
3) old writes - no reorder with #1 per (C)
4) new writes - no reorder with #1 and #2 per (C) and (B)

Can the above act like the 'mfence' instruction? Is it as expensive?

Thanks,
Rani

jimdempseyatthecove · ‎06-10-2011

Rani,

Most programmers program in a higher level language such as C/C++. This compilers have optimizations which can and do reorder statements as well as eliminate some memory stores and loads. What you have here are two barrier issues: Is the compiler doing what you ask it to (with respect to memory store) and is the memory store technique correct not only for your current hardware thread but for all threads, potentially on separatedmultiple processors?

There are some multi-threaded coding sequences where you must observe that a store is (or was) observed(observable) by the other threads including threads on separated processor. For these coding situations not only is order important but "finality" is important. IOW your thread may have ordered writes pending when your thread's subsequent code requires the writes to have completed such that the other thread(s) view is consistent with your threads presumptions.

This does not mean you must use mfence as your code sequence may not require it for proper functioning (with other threads). An example of this is the single producer single consumer queue and single producer multiple consumer queue. The producer side of the code need only concern itself with proper order and not finality (with possible exception to buffer full condition transitioning to buffer not full).

Jim Dempsey

ranis · ‎06-10-2011

Hi Jim,

Thanks for the insightful information about high level MT concerns.

Say that I'm compiler writer or low level synchronization facility writer, will the above push/pop eax act as mfence/full-barrier on x86/x64 (at least for data as code might be using non coherent memory)?

FWIW, for example, Linux spinlock release is implemented for x86 by single non-locked write since it provides the appropriate release semantics (i.e. no reorder with old/new writers and with old reads).

I see two concerns here:

1) Should the compiler beware of such generating implicit fences/barriers in case that they are expensive (e.g. as mfence)?

2) Can the compiler exploit such implicit barriers in case that they are much cheaper than fences/locked instructions?
Can synchronization facility enjoy from the same?

For example, say that compiler wants to provide efficient barrier command which guarantees no compiler/CPU reordering (e.g. MemoryBarrier() on windows):

A = 1; // store

MemBarrier(); // avoid A-store B-load reorder using dummy (re)read of A

If (B) { // load

Thanks,
Rani

jimdempseyatthecove · ‎06-10-2011

Rani,

MemoryBarrier and _mm_mfence when defined to use xchg of uninitialized data (junk) with eax.
This does two things:

1) xchg implicitly performs a LOCK (as if LOCK prefix were on the instruction). The LOCK will assure that the external view of memory is consistent with your core's internalview of the (cached) memory.

2) the function is not PURE (and contains __asm)and therefore the compiler optimizations will assume anything could have been (or could be)trashed. Meaning compiler registerized and modified variables may get written by code generated by the compiler (occure before function call) and any registers that formerly had registerized variables will now have to refresh themselves from memory (at least from the coherent cache system)

Jim

ranis · ‎06-10-2011

I see your point but I actually wanted to get kind of official x86/x64 answer about whether store-A following load-A acts as memory barrier and whether it's as expensive as mfence/locked-xchg. I might have to settle on private experimenting to figure such...

Thanks,
Rani

ranis · ‎06-10-2011

The Intel guide proves me *wrong* since the ordering rules are not as
straight forward as I thought.

8.2.3.5 Intra-Processor Forwarding Is Allowed

Processor 0 Processor 1
mov [ _x], 1 mov [ _y], 1
mov r1, [ _x] mov r3, [ _y]
mov r2, [ _y] mov r4, [ _x]
Initially x = y = 0
r2 = 0 and r4 = 0 is allowed

The memory-ordering model imposes no constraints on the order in which the two stores appear to execute by the two processors. This fact allows processor 0 to see its store before seeing processor 1's, while processor 1 sees its store before seeing processor 0's. (Each processor is self consistent.)

This allows r2 = 0 and r4 = 0.

In practice, the reordering in this example can arise as a result of store-buffer forwarding. While a store is temporarily held in a processor's store buffer, it can satisfy the processor's own loads but is not visible to (and cannot satisfy) loads by other processors

Now I need to re-figure why the non locked write works for spinlock
release...

Rani

jimdempseyatthecove · ‎06-10-2011

Release is fine.

Consider this - you hold the lock (1 or -1 or some other non-zeroin SpinLock variable).
You overstrike the non-0 (whatever it is) with 0 thus releasing the lock.

At the time you do the overstrike the other thread(s) potentially looking at the lock could be context switched out as well as busy spinning looking for the variable to change. In the releasecase a "lazy write" is acceptible. What is required (and does happen) is if you release lock A then release lock B that the observable order be maintained .OR. at least A and B release occures simulteaneously. (Just not observed asB then A.)

A disadvantage in "lazy write" is it increases the latency that the other threads may have in observing (and reacting) to the release. However, if no other thread is waiting for the release then this will not matter.

A second potential disadvantage is if you do back to back unlock then lock of the same SpinLock. The "lazy write" technique may be biased towards your thread.

Here is a non-Intel site with instruction latencies

http://www.freeweb.hu/instlatx64/

Jim Dempsey