Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

why only acquire in xchg?

xt2
Beginner
1,134 Views

why xchg only has acquire semantic?

I found an old post like this, but did not get the answer, i am an application developer,and learned some memory barrier concepts recently.

i ask this question because i found that windows is using xchg to implement its memory barrier,

http://windowssdk.msdn.microsoft.com/en-us/library/ms684208.aspx

but shouldn't a memory barrier have both acquire and release semantics?

can anyone shed light on it, am i apparently misunderstanding something?

Thanks,

0 Kudos
3 Replies
jimdempseyatthecove
Honored Contributor III
1,134 Views

The "void MemoryBarrier(void);" function is analogous to OpenMP FLUSH.

From the MS link:

>>
This macro is called when the order of memory reads and writes are critical for program operation. It is often used with multithread synchronization functions such as InterlockedExchange.

This macro can be called on all processor platforms where Windows is supported, but it has no effect on some platforms. The definition varies from platform to platform. The following are some definitions of this macro in Winnt.h:


#define MemoryBarrier _mm_mfence

FORCEINLINE
VOID
MemoryBarrier (
VOID
)
{
LONG Barrier;
__asm {
xchg Barrier, eax
}
}

#define MemoryBarrier __mf
<<

In the above it eludes that the macro is used in InterlockedExchangebut I believe they should have said the technique is used.

I do not have a copy of InterlockedExchange here but I venture to guess it looks something like:

LONG InterlockedExchange(
LONG volatile* Target,
LONG Value
)
{
__asm {
mov eax, dword ptr [Value]
mov edx, dword ptr [Target]
xchg dword ptr [edx], eax
}
}

The InterlockedExchange can be used for acquiring and releasing things like a SpinLock or an OpenMP critical section.

The MemoryBarrier function (macro) is used when you are not using an interlock type function but wish to ensure pending writes (from processor/cache to RAM) are flushed.

An example of this is queue system where you have a source (creator of nodes) and sink (consumer of nodes/worker thread) and perhaps due to the way the application is writtenit inserts a node then increments a count and that these operations are being performed without interlocked function calls. In this situation, logically, the insertion of the node occurs before the increment of the count. But due to the possibility of out-of-order writes the count might be observed before the insertion. Inserting the MemoryBarrier between the insert node and increment count would ensure the insertion completes before the increment.

If you were to call InterlockedExchange to acquire a resource then, depending on how you code is written you might not need to call InterlockedExchange to release the resource. Example:

int ResourceFlag = 0; // 0=Free, 1=Inuse
...
// Wait until we get the resource
while(InterlockedExchange(&ResourceFlag, 1) != 0) {
Sleep(0); } // Wait a tad
// Have the resource
... // do your thing
ResourceFlag = 0; // Release the resource

Jim Dempsey

0 Kudos
xt2
Beginner
1,134 Views

Thanks a lot, it really helps.

Actually my confusion comes fromC#'s ECMA CLI spec

http://www.ecma-international.org/publications/standards/Ecma-335.htm, partition 1, section 12.6.7, which talks about memory model .

it says: "A volatile read has acquire semantics meaning that the read is guaranteed to occur prior to any references to memory that occur after the read instruction in the CIL instruction sequence. A volatile write has release semantics meaning that the write is guaranteed to happen after any memory references prior to the write instruction in the CIL instruction sequence."

I also found both System.Threading.Thread.VolatileRead and VolatileWrite are implemented by MemoryBarrier(),

==============

.method public hidebysig static int32 VolatileRead(int32& address) cil managed noinlining
{
// Code size 10 (0xa)
.maxstack 1
.locals init (int32 V_0)
IL_0000: ldarg.0
IL_0001: ldind.i4
IL_0002: stloc.0
IL_0003: call void System.Threading.Thread::MemoryBarrier()
IL_0008: ldloc.0
IL_0009: ret
} // end of method Thread::VolatileRead

===================

.method public hidebysig static void VolatileWrite(uint32& address,
uint32 'value') cil managed noinlining
{
.custom instance void System.CLSCompliantAttribute::.ctor(bool) = ( 01 00 00 00 00 )
// Code size 9 (0x9)
.maxstack 8
IL_0000: call void System.Threading.Thread::MemoryBarrier()
IL_0005: ldarg.0
IL_0006: ldarg.1
IL_0007: stind.i4
IL_0008: ret
} // end of method Thread::VolatileWrite

the above can be found by using ildasm tool accompanied with Visual Studio 2005.

If MemoryBarrier() is implemented by using xchg, then ifitONLY has "acquire semantic" as said by Intel's doc, how can it be used to implement both VolatileRead and VolatileWrite (which need "release semantic")?

After reading more on "acquire/release", ithink"acquire" meansread current CPU's thread working memory (register, cache) from main memory or from other cpu's cache, so that it always get the fresh data, that way, "acquire" prevents instructions afteritfrom being moved before "acquire", and "release" works on the opposite way, it flush current cpu's thread working memory to mainmemory,ortellother cpu to refetch data from main meory or from this cpu's cache, so "release" prevent instructions before it to be moved down after it. and "acquire + release" means full memory fence, is this understanding valid?

Thanks,

Xin

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,134 Views

Your thoughts are correct but I believe the actual sequences depend on the hardware implemntation. Not only are CPUs involved but additionaly memory mappedI/O devices may be involved.

The volatile read should, to the extent possible, extract the most recent value of a given memory location (or port)from all potential modifiers of the value for that location. Advanced hardware implementations permit one processor to peek at (read from)another processors cache. And some implimentationspermitting one processor to poke (write) to others cache. Earlier generations of processors used aslower method whichforced cache flushing to main memoryand invalidationsfrom other processors.

The barrier for read ahead is relatively intresting even for a single processor system. Potentialy, a system could have an I/O device such as an insturment with a memory mapped FPGA. Potentialy you could write to shared location A, the FPGA could modify location B and your application later reads location B. Since location A and B are viewed by the processor as memory addresses. Although the assembly code may perform the write of A prior to the read of B the processor may elect to read B first if it assumes there is no dependency on the write of A. In this case (FPGA coprocessor) the reading of B is dependent on the writing of A. Therefore you need the read barrier.

The processor engineers will give you a better (and longer) description of this.

Jim Dempsey

0 Kudos
Reply