Re: LOCK vs MFENCE

Dmitry_Vyukov · ‎10-30-2007

I've measured latency for 'lock cmpxchg' and 'mfence' instructions on Pentium 4 processor. I've got following results:

lock cmpxchg - 100 cycles
mfence - 104 cycles

So I conclude that they are nearly identical wrt consumed cycles.

But is there some difference between them wrt system performance? Especially on modern multicore processors (Core 2 Duo, Core 2 Quad)?

Is following assumption correct: Lock prefix affects bus/cache locking, so has impact on total system performance. And mfence has only local impact on current core.

Or more practical: If I have 2 algorithms - one use lock prefix, and another use mfence. Other things being equal, what I must prefer?

Thanks for any advance
Dmitriy V'jukov

levicki · ‎11-02-2007

Those two instructions do completely different things. You cannot use mfence instead of lock prefix.

Description:

Performs a serializing operation on all load and store instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes the MFENCE instruction is globally visible before any load or store instruction that follows the MFENCE instruction. The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any SFENCE and LFENCE instructions, and any serializing instructions (such as the CPUID instruction).

Weakly ordered memory types can enable higher performance through such techniques as out-of-order issue, speculative reads, write-combining, and write-collapsing. The degree to which a consumer of data recognizes or knows that the data is weakly ordered varies among applications and may be unknown to the producer of this data. The MFENCE instruction provides a performance-efficient way of ensuring ordering between routines that produce weakly-ordered results and routines that consume this data.

It should be noted that processors are free to speculatively fetch and cache data from system memory regions that are assigned a memory-type that permits speculative reads (that is, the WB, WC, and WT memory types). The PREFETCHh instruction is considered a hint to this speculative behavior. Because this speculative fetching can occur at any time and is not tied to instruction execution, the MFENCE instruction is not ordered with respect to PREFETCHh or any of the speculative fetching mechanisms (that is, data could be speculative loaded into the cache just before, during, or after the execution of an MFENCE instruction).

Dmitry_Vyukov · ‎11-04-2007

IgorLevicki:
Those two instructions do completely different things. You cannot use mfence instead of lock prefix.

I know.

This can be learned from basic documentation. What can't be learned from basic documentation - their impact on system performance.

Dmitriy V'jukov

levicki · ‎11-04-2007

I do not know what you are trying to do but I can tell you this — most I ever needed was SFENCE instruction when I was using non-temporal stores to copy data.

That said, I haven't noticed any performance degradation from SFENCE. If there was any, it was offset by faster transfer speed which came from using non-temporal stores. Bear in mind though, that those stores are meant to be used only for data which won't be immediately reused and that store buffers are scarce resource in some CPUs.

Finally, whatever their impact may be, they are needed for coherence so it is out of the question whether you should use them or not if your code needs them.

Dmitry_Vyukov · ‎11-05-2007

IgorLevicki:
I do not know what you are trying to do...

Consider for example following situation.
Program use sufficiently large amount of mutexes. Every particular mutex synchronize only 2 threads.
I can implement mutex with:
1. "Traditional" scheme. Based on "lock xchg" in acquire operation and "naked store" in release operation.
2. Peterson algorithm. Based on #StoreLoad memory barrier (mfence) in acquire operation and "naked store" in release operation.

So net difference is - LOCK vs MFENCE.

The question is: Will be any difference in system performance on quad core machine?

Dmitriy V'jukov

Anat_S_Intel · ‎11-05-2007

lock has a similar effect to mfence, so from that respect they should have the same performance.

The Peterson algorithm that I found on the internet has 3 synchronization variables (one for the loser and two for the interested parties) that the threads share. The traditional algorithm has only one synchronization variable. Therefore the Peterson algorithm has more potential for long latency modified data sharing.

Dmitry_Vyukov · ‎11-06-2007

anshgm:
lock has a similar effect to mfence, so from that respect they should have the same performance.

SHOULD HAVE or HAVE?

anshgm:
The Peterson algorithm that I found on the internet has 3 synchronization variables (one for the loser and two for the interested parties) that the threads share. The traditional algorithm has only one synchronization variable. Therefore the Peterson algorithm has more potential for long latency modified data sharing.

IMHO number of shared variables don't play significant role. It's number of shared cache lines what play role. And number of heavy operations (lock, mfence).

Dmitriy V'jukov

Chris_M__Thomasson · ‎12-19-2007

Here is an implementation I did a while back which makes use of MFENCE:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/c49c0658e2607317

The fact that is has 3 variables doesnt mean that much because, as Dmitriy points out, they all fit within a single L2-Cacheline. The only advantage I can see is that the algorithm contains no interlocked RMW instructions, which should be easier on the FSB.

addyvarma · ‎11-16-2009

Hi

I was trying to implement a flavor of barrier wait to synchronize pthreads( the machine's Xeon X5, running linux 2.6 and gcc 4.1.1).
To ensure sequential conistency - i sprinkled my code with mfences - unfortunately i find the memory consistency being broken when i run this code. I was wondering if i've missed something?

I'm using 3 booleans as the synchronization flags; have spread 'em across 3 cache lines to prevent false sharing. Am synchronizing 3 threads (taking 3 as a small example)

uint8_t X[3 * cachelinesize]; (where cache line size is 64 in this case)

my mem barrier is " __asm__ __volatile__ ("mfence" : : : "memory");"
(pardon the overkill - i've used mem barriers instead of lfences/sfences - just wanted to make sure this works, before i
optimize it)

ThreadA

while (1) {
mem barrier #1;
while ( X[0 * cacheline_size] != 0) ; // i tried inserting pause too ------------> POINT A
mem barrier #2;
do some work;

mem barrrier #3;
X[0 *cacheline_size] = 1;
mem barrier # 4;

while (X[0 * cacheline size] != 2) ; // spin

mem barrier #5;
do something ;

mem barrier #6;
X[0 * cacheline size] = 3;------------> POINT C
mem barrier # 7;
}
............
(similar code in the other threads)

Main synchronizing thread
------------------------------------
while (1) {
mem barrier # 8;
while( all X[ 0...n] not 1) ; / => spin until (X[ 0 * cacheline size] == 1) || (X[1 * cacheline size == 1]) etc
mem barrier # 9;

X[0...n] = 2; // tried doing this in a CAS too - didn't help;
mem barrier # 10;

while(X[0...n] != 3) ; // spin ------------> POINT B
mem barrier # 11

do somthing
mem barrier #12

X[0...n ] = 0;
mem barrier;
}

Problem/Issue

I get into a deadlock at times - thread A is waiting on (X[0] ! = 0)( point A) and the main thread is
waiting at point B with X[0] = 2- there's no way that could have happened unless thread A bypassed point C.

Could someone tell me if i've screwed up someplace?

Thanks
-Addy

}

Dmitry_Vyukov · ‎11-16-2009

Try to declare X as volatile.

addyvarma · ‎11-16-2009

Sorry. I forgot to mention that - I did declare the synchonizing flag as a volatile. Beats me why this piece of code is still non -determisitic. To be doubly sure that the compiler itself hasn't reordered code, I made an non-optimized build - (with gcc using -O0).

-Addy

Quoting - Dmitriy Vyukov

Try to declare X as volatile.

Dmitry_Vyukov · ‎11-17-2009

Please post full code into new forum thread.

Dmitry_Vyukov · ‎11-17-2009

Quoting - addyvarma

my mem barrier is " __asm__ __volatile__ ("mfence" : : : "memory");"
(pardon the overkill - i've used mem barriers instead of lfences/sfences - just wanted to make sure this works, before i
optimize it)

You probably mis-understand semantics of x86 fences.
SFENCE is of any use ONLY if you use non-temporal store instructions (e.g. MOVNTI).
And LFENCE is completely useless, it's basically a no-op.

MFENCE of any use ONLY is you are trying to order critical store-load sequence. As far as I see, there is no critical store-load sequences in your example, so you need no hardware fences on x86 at all. Just declare variables as volatile so that compiler preserve program order.

addyvarma · ‎11-18-2009

Sure, I'll put it in as another post.

But here's one question, before I move it to another post.

I did use just volatiles before I used the fences - and that because just using volatiles didn't help, I had to resort to fences
(Adve does talk of the sequential consistency issues for busy wait sync.).

So I guess we are missing something really basic here.

Thanks.

Quoting - Dmitriy Vyukov

Quoting - addyvarma

my mem barrier is " __asm__ __volatile__ ("mfence" : : : "memory");"
(pardon the overkill - i've used mem barriers instead of lfences/sfences - just wanted to make sure this works, before i
optimize it)

You probably mis-understand semantics of x86 fences.
SFENCE is of any use ONLY if you use non-temporal store instructions (e.g. MOVNTI).
And LFENCE is completely useless, it's basically a no-op.

MFENCE of any use ONLY is you are trying to order critical store-load sequence. As far as I see, there is no critical store-load sequences in your example, so you need no hardware fences on x86 at all. Just declare variables as volatile so that compiler preserve program order.

Dmitry_Vyukov · ‎11-18-2009

Quoting - addyvarma

I did use just volatiles before I used the fences - and that because just using volatiles didn't help, I had to resort to fences
(Adve does talk of the sequential consistency issues for busy wait sync.).

So I guess we are missing something really basic here.d

Since your program does not work with fences too, probably the problem is in another place, difficult to say w/o seeing the code.