Re: mfence and/or lock in multi-core systems

shiningram · ‎02-27-2009

Hi,
I went thru the information about mfence

"Performs a serializing operation on all load and store instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes the MFENCE instruction is globally visible before any load or store instruction that follows the MFENCE instruction. The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any SFENCE and LFENCE instructions, and any serializing instructions (such as the CPUID instruction)."

But few things are not clear to me.I am adding a sample code to understand thepractical use ofmfence.

1. How many serializing operationson all load and store instructions it performsprior the MFENCE instruction?
2. If I use mfence then do I need to use the lock ?

Here is example which increments 64 bit counter in 32 bit system.

mov ecx, edx
mov ebx, eax
add ebx, 1
adc ecx, 0
mfence
lock cmpxchg8b [edi]
mfence
jnz again
mov eax, dummy
mov [eax], ebx
mov [eax+4], ecx
pop ebx
pop edi

I went thru this post
http://software.intel.com/en-us/forums/showthread.php?t=56040

3. What should be correct use of mfence and/or lock under muti-core CPU systems in the above example?

Thanks,
Ram Regar

Dmitry_Vyukov · ‎02-28-2009

Quoting - shiningram

1. How many serializing operationson all load and store instructions it performsprior the MFENCE instruction?

MFENCE serializes ALL previous memory accesses with ALL subsequent memory accesses.

Dmitry_Vyukov · ‎02-28-2009

Quoting - shiningram

2. If I use mfence then do I need to use the lock ?
3. What should be correct use of mfence and/or lock under muti-core CPU systems in the above example?

LOCKed instruction includes previous and subsequent full memory fences. So just remove all MFENCEs from the code.

Dmitry_Vyukov · ‎03-01-2009

You've probably already seen this on c.p.t but I will post it here too for completeness.
LOCK may not synchronize non-temporal stores and WC-memory, this is architecture dependent:

> ------------------------------
> For the P6 family processors, locked operations serialize all
> outstanding load and store operations (that is, wait for them to
> complete). This rule is also true for the Pentium 4 and Intel Xeon
> processors, with one exception. Load operations that reference weakly
> ordered memory types (such as the WC memory type) may not be
> serialized.
> ------------------------------

In order to synchronize non-temporal stores and WC-memory you have to issue SFENCE (not MFENCE) before LOCKed instruction.

shiningram · ‎03-02-2009

Quoting - Dmitriy Vyukov

MFENCE serializes ALL previous memory accesses with ALL subsequent memory accesses.

1. What is the scope of serialization?
{
load1
load2
store1
load3
store2

mfence

store3
load4
store4
load5

}

In the above case load and store serializationwill bescoped byparenthesis? Can you please explain the meaning of ALL here?
The processor1 executing mfence will issue signal to all processors to finish all store/loads operations and wait before processor1 can start the executing cmpxchg8b atomically. Its like getting lock and releasing lock. Trying to understand how mfence works.

I learnt that cmpxchg8b implicitly has "lock" in it. Does that mean lock, mfence is not required at all when using cmpxchg8b ?

Thanks,
Ram Regar

Dmitry_Vyukov · ‎03-02-2009

Quoting - shiningram

In the above case load and store serializationwill bescoped byparenthesis? Can you please explain the meaning of ALL here?
The processor1 executing mfence will issue signal to all processors to finish all store/loads operations and wait before processor1 can start the executing cmpxchg8b atomically. Its like getting lock and releasing lock. Trying to understand how mfence works.

I learnt that cmpxchg8b implicitly has "lock" in it. Does that mean lock, mfence is not required at all when using cmpxchg8b ?

All preceding in program order memory accesses are serialized with all subsequent in program order memory accesses.
Modern processors lock only target cache-line, so if it is already cached in the core in M status, then NO global inter-core/processor interaction occurs. And I believe MFENCE is always local, i.e. NO global inter-core/processor interaction occurs. Global inter-core/processor ordering is handled by cache-coherence protocol.
XCHG has implicit LOCK, CMPXCHG has not.
When you are using LOCK CMPXCHG, MFENCE is not required (if someone uses non-temporal stores, then it's better to assume that it's HIS responsibility to serialize them).