Re: atomic add of doubles

sngan · ‎11-12-2009

does anyone has a perf. implementation of an atomic add for double values
by using f.e. a union with double/long and interlockedcompareexchange on top ???
or what will be the best way to do this atomic + fast ?

jimdempseyatthecove · ‎11-12-2009

Most current systems now support an 8 byte compare and swap. For those systems you can loop on performing the double add to temp, store temp in memory, load temp to register(s), do a compare and swap, if fail repeat loop. If there is a chance that your code will run on a system without 8 byte compare and swap then you should have a test for it (e.g. use CPUID). The recommended way then is to perform the atomic double add in two functions, one for systems with 8 byte compare and swap and one for systems without. Then use a functor (function pointer) to point to the correct function for the system (do this in your initialization code). The function for systems without 8 byte compare and swap would have to use a different technique such as a critical section.

Most of this work has already been done for you (e.g. IVF has internal support for this and so do most C++ implementations with OpenMP)

Because interlockedcompareexchange is expensive, it behooves you to reduce the number of times you execute the atomic add (sub, ...) on doubles. Use thread stack temporaries as long as you can, then do the atomic add once. Typically this is done for you if you use the reduction clauses within OpenMP.

Jim Dempsey

sngan · ‎11-12-2009

Quoting - jimdempseyatthecove

Most current systems now support an 8 byte compare and swap. For those systems you can loop on performing the double add to temp, store temp in memory, load temp to register(s), do a compare and swap, if fail repeat loop. If there is a chance that your code will run on a system without 8 byte compare and swap then you should have a test for it (e.g. use CPUID). The recommended way then is to perform the atomic double add in two functions, one for systems with 8 byte compare and swap and one for systems without. Then use a functor (function pointer) to point to the correct function for the system (do this in your initialization code). The function for systems without 8 byte compare and swap would have to use a different technique such as a critical section.

Most of this work has already been done for you (e.g. IVF has internal support for this and so do most C++ implementations with OpenMP)

Because interlockedcompareexchange is expensive, it behooves you to reduce the number of times you execute the atomic add (sub, ...) on doubles. Use thread stack temporaries as long as you can, then do the atomic add once. Typically this is done for you if you use the reduction clauses within OpenMP.

Jim Dempsey

thanks, openmp and things like that cannot be used and synced updates will only used when absolutely necessary.
i need some handwritten lines of code with unions of floating point/int. does some one ever tried this and has some perf. numbers ? compiler intrinsics or inline asm ?

jimdempseyatthecove · ‎11-12-2009

This is what I use on Windows platform with ICC in QuickThread (www.quickthreadprogramming.com)

[cpp]#if defined(_WIN64)
#define USE_InterlockedCompareExchange64 InterlockedCompareExchange64
#else
__int64 USE_InterlockedCompareExchange64(volatile LONGLONG* m, __int64 iNew, __int64 iOld)
{
	__int64 iReturn;
	_asm {
		push esi
		mov esi, m
		mov eax,dword ptr iOld
		mov edx,dword ptr iOld+4
		mov ebx,dword ptr iNew
		mov ecx,dword ptr iNew+4
		LOCK CMPXCHG8B [esi]
		pop esi
		mov dword ptr iReturn, eax
		mov dword ptr iReturn+4, edx
	};
	return iReturn;
}
#endif
double AtomicAdd(double* pd, double d)
{
	union
	{
		volatile LONGLONG iOld;
		double dOld;
	};
	union
	{
		LONGLONG iNew;
		double dNew;
	};
    for(;;)
	{
		iOld = *(__int64*)pd;	// current old value
		dNew = dOld + d;
		if(USE_InterlockedCompareExchange64( (volatile LONGLONG*)pd, iNew, iOld) == iOld)
			return dNew;
	}
}
[/cpp]

Jim Dempsey

jimdempseyatthecove · ‎11-12-2009

It should be obvious that the above requires a processor with CMPXCHG8B when running in 32-bit mode.
As stated in earlier post, if you plan to run on earlier processors then you will have to test for support of this instruction. Instead of test each time, or test once and set flag (then testflagandbranch each time), store address of proper function in location, then callindirect that location.

You can run your own timing.

Jim Dempsey

sngan · ‎11-13-2009

thanks a lot, i will give it a try. hopefully it is cheaper than: crit. section start, double add, crit. section end.

jimdempseyatthecove · ‎11-13-2009

Please note, if you are on x64 and not on Windows and thus do not have access to its intrinsic you can use the "lock cmpxchg" on 64-bit arguments. (Linux should have an equivilent, although the arguments may be reversed). Be aware that x64 uses Fast Call (args passed in registers). So that ASM code section will have to be rewritten accordingly.

Jim

Dmitry_Vyukov · ‎11-13-2009

Quoting - jimdempseyatthecove

Please note, if you are on x64 and not on Windows and thus do not have access to its intrinsic you can use the "lock cmpxchg" on 64-bit arguments. (Linux should have an equivilent, although the arguments may be reversed). Be aware that x64 uses Fast Call (args passed in registers). So that ASM code section will have to be rewritten accordingly.

GCC has a set of built-in atomic functions:
http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html#Atomic-Builtins
(don't forget to supply -march=i686 switch to compiler)

Dmitry_Vyukov · ‎11-13-2009

Quoting - Dmitriy Vyukov

GCC has a set of built-in atomic functions:
http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html#Atomic-Builtins
(don't forget to supply -march=i686 switch to compiler)

Also you may find implementation of atomic RMW operations for a lot of platforms for gcc in atomic_ops project:
http://www.hpl.hp.com/research/linux/atomic_ops/

Chris_M__Thomasson · ‎11-13-2009

Quoting - jimdempseyatthecove

This is what I use on Windows platform with ICC in QuickThread (www.quickthreadprogramming.com)

[cpp][...]
    for(;;)
	{
		iOld = *(__int64*)pd;	// current old value
		dNew = dOld + d;
		if(USE_InterlockedCompareExchange64( (volatile LONGLONG*)pd, iNew, iOld) == iOld)
			return dNew;
	}
}[/cpp]

I am wondering why you are not taking advantage of the return value of `InterlockedCompareExchange64()'? You are performing redundant loads every time the CAS fails. I usually do something like:

[cpp]void atomic_add(LONGLONG volatile* psrc,
                LONGLONG addend)
{
    LONGLONG cmp1;
    LONGLONG cmp2 = *psrc;

    do
    {
        cmp1 = cmp2;

        cmp2 = InterlockedCompareExchange64(
                   psrc,
                   cmp1 + addend,
                   cmp1
               );
    }
    
    while (cmp1 != cmp2);
}[/cpp]

BTW, QuickThreads looks very interesting indeed. I hope you make a lot of $$$!

;^)

jimdempseyatthecove · ‎11-13-2009

Chris,

When the CAS (DCAS for x32) fails then the value of the double was altered by a different thread.
This value is a double, not an __int64. therefore I cannot perform "cmp1 + addend" in the argument position within the InterlockedCompareExchange. I have to store into memory, load into FPU or MMX or SSE, do new add, save, load back into __int64 or eax/edx as the case may be, then try again.

A lot of hoop jumping but that is what is required.

Perhaps one day we can perform a direct register to register move between XMM0 and eax or rax (both ways)
Or better yet,

LOCK CMPXCHGXMMb mem_128_ptr, comparand, swap, mask
LOCK CMPXCHGXMMw mem_128_ptr, comparand, swap, mask

Where you have a pointer to 128-bit cache line contained memory location, a 128-bit candidate for comparrison, a 128-bit candidate for swap, an indicator as to if the mask is byte, word, dword, ... anda maskindicating whichdata perform the compare and swap (elements not flagged are unchanged). Restricting the comparrison to binary values as opposed to float/double would be and acceptible compramise

Then have AVX format with 256 bits (byte mask 32 bits)

The only requirement being the memory area must fit within a cache line.

Thiswould beinteresting from a parallel programming perspective since the instruction could be expanded tonot only perform == test but also include <, <=, >, >=, !=

Intel would have to do some simultion studies to see if this provides a good return on the effort necessary to expand the instruction set.

RE: $$$

Haven't put the "Buy it Now" up yet.
Will be doing this soon. (Working out some kinks in the distribution files).

Jim Dempsey

Chris_M__Thomasson · ‎11-13-2009

Quoting - jimdempseyatthecove

Chris,

When the CAS (DCAS for x32) fails then the value of the double was altered by a different thread.
This value is a double, not an __int64. therefore I cannot perform "cmp1 + addend" in the argument position within the InterlockedCompareExchange. I have to store into memory, load into FPU or MMX or SSE, do new add, save, load back into __int64 or eax/edx as the case may be, then try again.

A lot of hoop jumping but that is what is required.

[...]

RE: $$$

Haven't put the "Buy it Now" up yet.
Will be doing this soon. (Working out some kinks in the distribution files).

Humm... Please excuse my ignorance, but why wouldn't the following code work on a 32-bit system:

[cpp]__declspec(naked) 
static int
DWCAS(void volatile* const pself,
      void* pcmp,
      void const* pxhcg)
{
    _asm 
    {
        PUSH ESI
        PUSH EBX
        MOV ESI, [ESP + 16]
        MOV EAX, [ESI]
        MOV EDX, [ESI + 4]
        MOV ESI, [ESP + 20]
        MOV EBX, [ESI]
        MOV ECX, [ESI + 4]
        MOV ESI, [ESP + 12]
        LOCK CMPXCHG8B QWORD PTR [ESI]
        JNE DWCAS_failed
        MOV EAX, 1
        POP EBX
        POP ESI
        RET

    DWCAS_failed:
        MOV ESI, [ESP + 16]
        MOV [ESI], EAX
        MOV [ESI + 4], EDX
        MOV EAX, 0
        POP EBX
        POP ESI
        RET
    }
}


typedef char sassert
[
    sizeof(double) == sizeof(unsigned __int64) ? 1 : -1
];


static
double
atomic_add_double(double volatile* src,
                  double addend)
{
    double xchg;
    double cmp = *src;

    do
    {
        xchg = cmp + addend;
    }

    while (! DWCAS(src, &cmp, &xchg));

    return xchg;
}[/cpp]

What am I missing?

Chris_M__Thomasson · ‎11-13-2009

Quoting - jimdempseyatthecove

Chris,

When the CAS (DCAS for x32) fails then the value of the double was altered by a different thread.
This value is a double, not an __int64. therefore I cannot perform "cmp1 + addend" in the argument position within the InterlockedCompareExchange. I have to store into memory, load into FPU or MMX or SSE, do new add, save, load back into __int64 or eax/edx as the case may be, then try again.

A lot of hoop jumping but that is what is required.

[...]

Yes. I know that when `InterlockedCompareExchange64()' fails it means another thread mutated the value. However, the return value of the CAS is the updated value that the other thread wrote. So, there is no need to reload from the shared location because the failed CAS already did it for us.

Just to clarify for a moment, the first pseudo-code I posted was just an example of how to takeadvantageof the return value of `InterlockedCompareExchange64()'. The second code I posted shows how this can be used in the atomic addition of doubles. The `DWCAS()' function automatically updates the comparand on failure, which is prettyconvenient.; IMVHO of course.

;^)

jimdempseyatthecove · ‎11-13-2009

The original post wanted to perform an atomic add of a double.

Although the DWCAS returned the modified value into a register (or pair of registers) the data is not directly usable since it represents a double. I cannot add to the double in registers. This would require me to write the returned value to memory prior to performing the add of double. Since it is in memory already, I can simply perform the add of the double that is in memory (saving one write to memory).

Jim Dempsey

Chris_M__Thomasson · ‎11-13-2009

Quoting - jimdempseyatthecove

The original post wanted to perform an atomic add of a double.

Although the DWCAS returned the modified value into a register (or pair of registers) the data is not directly usable since it represents a double. I cannot add to the double in registers. This would require me to write the returned value to memory prior to performing the add of double. Since it is in memory already, I can simply perform the add of the double that is in memory (saving one write to memory).

Okay. Thanks for your patience Jim. I don't know why I was being do dense on this matter!

;^o

jimdempseyatthecove · ‎11-14-2009

No problem Chris.

Any time you write code like this you must step through all the paths to assure that the code you want is the code you get. Pay special attention to code the the compiler might optimize out. I have observed a few cases where ICC squashed out a pointer dereference of a volatile. The "bug" got caught because the variable that got squashed out was a loop control variable and the loop hung. Had the error caused a false positive then race condition would have been hidden.

Also, if you are developing a library of these routines, and then developing applications that use this library, then I caution you to turn off IPO (interprocedural optimizations) .OR. do development of the library on a different system from the development of the application using the library. IPO is too aggressive and if any source file that went into the library is on your system, even if it is not in your solution or project, it will get IPO'd with the rest of the code. Handling multi-threaded race conditions is bad enough when the sensitive code sits in one spot, but when it resides in multiple places or worse gets optimized in a manner not condusive to theMT code you will have problems. These are particularly nasty because back in the library development context you are looking at different binary code.

Jim Dempsey

Chris_M__Thomasson · ‎11-15-2009

Quoting - jimdempseyatthecove

No problem Chris.

Any time you write code like this you must step through all the paths to assure that the code you want is the code you get. Pay special attention to code the the compiler might optimize out. I have observed a few cases where ICC squashed out a pointer dereference of a volatile. The "bug" got caught because the variable that got squashed out was a loop control variable and the loop hung. Had the error caused a false positive then race condition would have been hidden.

Also, if you are developing a library of these routines, and then developing applications that use this library, then I caution you to turn off IPO (interprocedural optimizations) .OR. do development of the library on a different system from the development of the application using the library. IPO is too aggressive and if any source file that went into the library is on your system, even if it is not in your solution or project, it will get IPO'd with the rest of the code. Handling multi-threaded race conditions is bad enough when the sensitive code sits in one spot, but when it resides in multiple places or worse gets optimized in a manner not condusive to theMT code you will have problems. These are particularly nasty because back in the library development context you are looking at different binary code.

Jim Dempsey

GreatadviseSir! I have learned to turn link time optmizations off as a habit. Anyway, WRT double's (e.g. this conversation), Iapologizeto everybody for making acompletemassive fooloutof myself. Thank you for not flaming me intooblivion!

;^(...

jimdempseyatthecove · ‎11-15-2009

RE:

What am I missing?

A potential problem with

static
double
atomic_add_double(double volatile* src,
double addend)
{
double xchg;
double cmp = *src;

do
{
xchg = cmp + addend;
}

while (! DWCAS(src, &cmp, &xchg));

return xchg;
}

The compiler may (will likely with full optimizations) registerize "cmp" meaning the old value of *srcwill be used in "cmp + addend"after failure of DWCAS. The DWCAS on fail will modify the stack location reserved for cmp but not the registered cmp. To correct for this you might need to use "volatile double cmp = *src;". The fact that one compiler with one set of optimizations produces the correct code is no assurance that all compilers nor all levels of optimizations produce the correct code.

Granted, the compiler should see the &cmp and make the inference that cmp may change as a result of the call and thus produce the correct code. I am suggesting that you verify this.

Now in any event, should the compiler generate the correct code then your code may be slower since you write a copy of DWCAS caputured src to memory, then on return fetch from memory (L1 cache) to produce the add. In the code I posted I omit the write of the copy and simply fetch from memory (L1, L2, L3 or memory) of the updated value.

First, the code should work always. Second, if you have a high degree of DWCAS failures then you really should spend the time to rework the code to reduce the requirement for DWCAS as this is very expensive. So as to if the failure path is slightly faster one way or the other this is immaterial.

Jim Dempsey

Dmitry_Vyukov · ‎11-15-2009

Quoting - jimdempseyatthecove

The compiler may (will likely with full optimizations) registerize "cmp" meaning the old value of *srcwill be used in "cmp + addend"after failure of DWCAS. The DWCAS on fail will modify the stack location reserved for cmp but not the registered cmp. To correct for this...

Doesn't such optimization also break functions like sscanf()? I can't imagine how any sane compiler may ever do something like that...

jimdempseyatthecove · ‎11-15-2009

Dmitriy,

The problem(s) does(do) not break the function call. Rather it breaks (at infrequent times) simple control logic that includes the function call. Usually I've seen

while(!*pointerToVolatile)
{
someFunct();
}

fail to obtain a fresh copy of the memory location at pointerToVolatile

Where someFunct could be sscanf or whatever and is not affected.

Jim

Dmitry_Vyukov · ‎11-15-2009

Quoting - jimdempseyatthecove

Dmitriy,

The problem(s) does(do) not break the function call. Rather it breaks (at infrequent times) simple control logic that includes the function call. Usually I've seen

while(!*pointerToVolatile)
{
someFunct();
}

fail to obtain a fresh copy of the memory location at pointerToVolatile

Where someFunct could be sscanf or whatever and is not affected.

On what compilers did you obeserve such problems? On 'modern' compilers (gcc 4.x, MSVC2005/2008) or on something older?