Re: Atomic floats, is this implementation safe?

robert_jay_gould · ‎10-27-2008

I made an atomic float, and it's probably not blazingly fast (that's ok), but its faster than wrapping a lock around a float, and it works, but I'm not sure if this is because of my good luck, or if this is actually thread safe. I think it is... but you never know, so I came to ask the experts :)

struct AtomicFloat: public tbb::atomic

{

float compare_and_swap(float value, float compare)

{

size_t value_ = tbb::atomic::compare_and_swap((size_t&)value,(size_t&)compare);

return reinterpret_cast(value_);

}

float operator=(float value)

{

size_t value_ = (*this).tbb::atomic::operator =((size_t&)value);

return reinterpret_cast(value_);

}

operator float()

{

return reinterpret_cast(*this);

}

float operator+=(float value)

{

volatile float old_value_, new_value_;

do

{

old_value_ = reinterpret_cast(*this);

new_value_ = old_value_ + value;

} while(compare_and_swap(new_value_,old_value_) != old_value_);

return (new_value_);

}

};

Also as a caveat I'm placing a static assert for size_of(float) == size_of(size_t).

Thanks!

RafSchietekat · ‎10-27-2008

I dare not look too closely, but would you pick my lottery numbers next week?

How is this used? Has anybody else missed atomic floats? What operations should be supported?

robert_jay_gould · ‎10-28-2008

Quoting - Raf Schietekat

I dare not look too closely, but would you pick my lottery numbers next week?

How is this used? Has anybody else missed atomic floats? What operations should be supported?

Sure I'll pick your lottery numbers :)

There you go 3-7-12-23-24.

But on a more serious note,I want/plan to support all generic tbb::atomic operations with the same interface, if this is a viable solution(thread safe).

And their intended use is as data-members on objects.

class ThreadSafeActor

{

atomic fingers;

atomic thirst;

atomic hunger;

};

So you see I'd rather have atomic floats, than locks on the object or each data member. And so far it amazingly works while running 10 threads concurrently on the same object. Nothing has broken so far...

----------

Edit: Performance wise using a locked float for 5.000.000 += operations with 100 threads on my machine takes 3.6s, while my atomic float even with its silly do-while takes 0.2s to do the same work. So the >30x performance boost means its worth it, (and this is the catch) if its correct.

robert_jay_gould · ‎10-28-2008

Here is a full code listing that works, and is about 30x faster than a locked float

But having someone from intel say if this is kosher or not would be great.

struct AtomicFloat: public tbb::atomic
{

template
float fetch_and_add( float addend )
{

const uint_32 value_ = tbb::atomic::fetch_and_add((uint_32&)addend);
return reinterpret_cast(value_);

}

float fetch_and_add( float addend )
{

const uint_32 value_ = tbb::atomic::fetch_and_add((uint_32&)addend);

return reinterpret_cast(value_);

}

template
float fetch_and_increment()
{

const uint_32 value_ = tbb::atomic::fetch_and_increment();
return reinterpret_cast(value_);

}

float fetch_and_increment()
{

const uint_32 value_ = tbb::atomic::fetch_and_increment();
return reinterpret_cast(value_);

}

template
float fetch_and_decrement()
{

const uint_32 value_ = tbb::atomic::fetch_and_decrement();
return reinterpret_cast(value_);

}

float fetch_and_decrement()
{

const uint_32 value_ = tbb::atomic::fetch_and_decrement();
return reinterpret_cast(value_)

}

template
float fetch_and_store( float value )
{

const uint_32 value_ = tbb::atomic::fetch_and_store((uint_32&)value);
return reinterpret_cast(value_);

}

float fetch_and_store( float value )
{

const uint_32 value_ = tbb::atomic::fetch_and_store((uint_32&)value);
return reinterpret_cast(value_);

}

template
value_type compare_and_swap( value_type value, value_type comparand )
{

const uint_32 value_ = tbb::atomic::compare_and_swap((uint_32&)value,(uint_32&)compare);
return reinterpret_cast(value_);

}

float compare_and_swap(float value, float compare)
{

const uint_32 value_ = tbb::atomic::compare_and_swap((uint_32&)value,(uint_32&)compare);
return reinterpret_cast(value_);

}

operator float() const volatile // volatile qualifier here for backwards compatibility
{

const uint_32 value_ = (*this);
return reinterpret_cast(value_);

}

float& _internal_reference() const
{

return reinterpret_cast(this->my_value);

}

float operator=(float value){

const uint_32 value_ = (*this).tbb::atomic::operator =((uint_32&)value);
return reinterpret_cast(value_);

}

float operator+=(float value)
{

volatile float old_value_, new_value_;
do
{

old_value_ = reinterpret_cast(*this);
new_value_ = old_value_ + value;

} while(compare_and_swap(new_value_,old_value_) != old_value_);
return (new_value_);

}

float operator-=(float value)
{

volatile float old_value_, new_value_;
do
{

old_value_ = reinterpret_cast(*this);
new_value_ = old_value_ - value;

} while(compare_and_swap(new_value_,old_value_) != old_value_);
return (new_value_);

}

float operator++()
{

volatile float old_value_, new_value_;
do
{

old_value_ = reinterpret_cast(*this);
new_value_ = old_value_++;

} while(compare_and_swap(new_value_,old_value_) != old_value_);
return (new_value_);

}

float operator--()
{

volatile float old_value_, new_value_;
do
{

old_value_ = reinterpret_cast(*this);
new_value_ = old_value_--;

} while(compare_and_swap(new_value_,old_value_) != old_value_);
return (new_value_)

}

};

RafSchietekat · ‎10-28-2008

I would start by not inheriting from another atomic (maybe privately), using tbb::atomic_word instead (even if uint32_t is already a big improvement over size_t), rewriting the operations involving addition (oops!), thinking about what equality really means, ...

RafSchietekat · ‎10-28-2008

Quoting - Raf Schietekat

I would start by not inheriting from another atomic (maybe privately), using tbb::atomic_word instead (even if uint32_t is already a big improvement over size_t), rewriting the operations involving addition (oops!), thinking about what equality really means, ...

I miss being able to edit later...

atomic_word is not in tbb, of course, which makes an assert look more attractive against the alternative of emulating or blindly trusting atomic_word.

robert_jay_gould · ‎10-28-2008

Fixed some stuff and removed some stuff... especially the additions, oops! :)

struct AtomicFloat

{

tbb::atomic atomic_value_;

template

float fetch_and_store( float value )

{

const uint_32 value_ = atomic_value_.tbb::atomic::fetch_and_store((uint_32&)value);

return reinterpret_cast(value_);

}

float fetch_and_store( float value )

{

const uint_32 value_ = atomic_value_.tbb::atomic::fetch_and_store((uint_32&)value);

return reinterpret_cast(value_);

}

template

float compare_and_swap( float value, float comparand )

{

const uint_32 value_ = atomic_value_.tbb::atomic::compare_and_swap((uint_32&)value,(uint_32&)compare);

return reinterpret_cast(value_);

}

float compare_and_swap(float value, float compare)

{

const uint_32 value_ = atomic_value_.tbb::atomic::compare_and_swap((uint_32&)value,(uint_32&)compare);

return reinterpret_cast(value_);

}

operator float() const volatile // volatile qualifier here for backwards compatibility

{

const uint_32 value_ = atomic_value_;

return reinterpret_cast(value_);

}

float operator=(float value)

{

const uint_32 value_ = atomic_value_.tbb::atomic::operator =((uint_32&)value);

return reinterpret_cast(value_);

}

float operator+=(float value)

{

volatile float old_value_, new_value_;

do

{

old_value_ = reinterpret_cast(atomic_value_);

new_value_ = old_value_ + value;

} while(compare_and_swap(new_value_,old_value_) != old_value_);

return (new_value_);

}

float operator*=(float value)

{

volatile float old_value_, new_value_;

do

{

old_value_ = reinterpret_cast(atomic_value_);

new_value_ = old_value_ * value;

} while(compare_and_swap(new_value_,old_value_) != old_value_);

return (new_value_);

}

float operator/=(float value)

{

volatile float old_value_, new_value_;

do

{

old_value_ = reinterpret_cast(atomic_value_);

new_value_ = old_value_ / value;

} while(compare_and_swap(new_value_,old_value_) != old_value_);

return (new_value_);

}

float operator-=(float value)

{

return this->operator+=(-value);

}

float operator++()

{

return this->operator+=(1);

}

float operator--()

{

return this->operator+=(-1);

}

float fetch_and_add( float addend )

{

return this->operator+=(-addend);

}

float fetch_and_increment()

{

return this->operator+=(1);

}

float fetch_and_decrement()

{

return this->operator+=(-1);

}

};

RafSchietekat · ‎10-28-2008

I would omit the notions of increment and decrement (a normal float does not have them either), but you are still evaluating the outcome of compare_and_swap as if you didn't have to care about the binary representation of plus/minus zero.

jimdempseyatthecove · ‎10-28-2008

Quoting - Raf Schietekat

I miss being able to edit later...

atomic_word is not in tbb, of course, which makes an assert look more attractive against the alternative of emulating or blindly trusting atomic_word.

The underlaying compare and swap is operating on32-bit valuewhen used for float, and presumably later by 64-bit value when used for double. The values as used by the compare and swap pass through general purpose registers and not through the FPU or SSE/MMX registers. Therefore, it may be advisable in your compare_and_swap to not obtain the old value (for use as comparand) by way of C++ statements that treat the copy operation as a copy of floats, which may use the FPU, SSE/MMX or other instructions for different processore. Instead, I would suggest copying the old value, for use as comparand, using a statment that casts the variable as int32 or int64 as the case may be. To use the cast as float may exhibit problems with old values containing denormalized numbers, NaN's, +/- Infinities, -0, etc..

Jim Dempsey

RafSchietekat · ‎10-28-2008

Quoting - Raf Schietekat

I would omit the notions of increment and decrement (a normal float does not have them either), but you are still evaluating the outcome of compare_and_swap as if you didn't have to care about the binary representation of plus/minus zero.

Sorry, I should have checked first: increment and decrement are defined on all arithmetic and pointer types, which includes floating point. Even a bool can be incremented (which results in value true)... but not decremented; I'm happy that I didn't know this (although now I do).

robert_jay_gould · ‎10-28-2008

Here is the latest iteration of the code, any comments?

#include

typedef unsigned int uint_32;

typedef __TBB_LONG_LONG uint_64;

template

struct atomic_float_

{

/* CRC Card -----------------------------------------------------

| Class: atmomic float template class

|

| Responsability: handle integral atomic memory as it were a float,

| but partially bypassing FPU, SSE/MMX, so it is

| slower than a true float, but faster and smaller

| than a locked float.

| *Warning* If your float usage is thwarted by

| the A-B-A problem this class isn't for you

| *Warning* Atomic specification says we return,

| values not l-values. So (i = j) = k doesn't work.

|

| Collaborators: intel's tbb::atomic handles memory atomicity

----------------------------------------------------------------*/

typedef typename atomic_float_ self_t;

tbb::atomic atomic_value_;

template

FLOATING_POINT fetch_and_store( FLOATING_POINT value )

{

const MEMORY_BLOCK value_ =

atomic_value_.tbb::atomic::fetch_and_store((MEMORY_BLOCK&)value);

//atomic specification requires returning old value, not new one

return reinterpret_cast(value_);

}

FLOATING_POINT fetch_and_store( FLOATING_POINT value )

{

const MEMORY_BLOCK value_ =

atomic_value_.tbb::atomic::fetch_and_store((MEMORY_BLOCK&)value);

//atomic specification requires returning old value, not new one

return reinterpret_cast(value_);

}

template

FLOATING_POINT compare_and_swap( FLOATING_POINT value, FLOATING_POINT comparand )

{

const MEMORY_BLOCK value_ = atomic_value_.tbb::atomic::compare_and_swap((MEMORY_BLOCK&)value,(MEMORY_BLOCK&)compare);

//atomic specification requires returning old value, not new one

return reinterpret_cast(value_);

}

FLOATING_POINT compare_and_swap(FLOATING_POINT value, FLOATING_POINT compare)

{

const MEMORY_BLOCK value_ =

atomic_value_.tbb::atomic::compare_and_swap((MEMORY_BLOCK&)value,(MEMORY_BLOCK&)compare);

//atomic specification requires returning old value, not new one

return reinterpret_cast(value_);

}

operator FLOATING_POINT() const volatile // volatile qualifier here for backwards compatibility

{

const MEMORY_BLOCK value_ = atomic_value_;

return reinterpret_cast(value_);

}

//Note: atomic specification says we return the a copy of the base value not an l-value

FLOATING_POINT operator=(FLOATING_POINT rhs)

{

const MEMORY_BLOCK value_ = atomic_value_.tbb::atomic::operator =((MEMORY_BLOCK&)rhs);

return reinterpret_cast(value_);

}

//Note: atomic specification says we return an l-value when operating among atomics

self_t& operator=(self_t& rhs)

{

const MEMORY_BLOCK value_ = atomic_value_.tbb::atomic::operator =((MEMORY_BLOCK&)rhs);

return *this;

}

FLOATING_POINT& _internal_reference() const

{

return reinterpret_cast(atomic_value_.tbb::atomic::_internal_reference());

}

FLOATING_POINT operator+=(FLOATING_POINT value)

{

FLOATING_POINT old_value_, new_value_;

do

{

old_value_ = reinterpret_cast(atomic_value_);

new_value_ = old_value_ + value;

//floating point binary representation is not an issue because

//we are using our self's compare and swap, thus comparing floats and floats

} while(self_t::compare_and_swap(new_value_,old_value_) != old_value_);

return (new_value_); //return resulting value

}

FLOATING_POINT operator*=(FLOATING_POINT value)

{

FLOATING_POINT old_value_, new_value_;

do

{

old_value_ = reinterpret_cast(atomic_value_);

new_value_ = old_value_ * value;

//floating point binary representation is not an issue becaus

//we are using our self's compare and swap, thus comparing floats and floats

} while(self_t::compare_and_swap(new_value_,old_value_) != old_value_);

return (new_value_); //return resulting value

}

FLOATING_POINT operator/=(FLOATING_POINT value)

{

FLOATING_POINT old_value_, new_value_;

do

{

old_value_ = reinterpret_cast(atomic_value_);

new_value_ = old_value_ / value;

//floating point binary representation is not an issue because

//we are using our self's compare and swap, thus comparing floats and floats

} while(self_t::compare_and_swap(new_value_,old_value_) != old_value_);

return (new_value_); //return resulting value

}

FLOATING_POINT operator-=(FLOATING_POINT value)

{

return this->operator+=(-value); //return resulting value

}

//Prefix operator

FLOATING_POINT operator++()

{

return this->operator+=(1); //return resulting value

}

//Prefix operator

FLOATING_POINT operator--()

{

return this->operator+=(-1); //return resulting value

}

//Postfix operator

FLOATING_POINT operator++(int)

{

const FLOATING_POINT temp = this;

this->operator+=(1);

return temp//return resulting value

}

//Postfix operator

FLOATING_POINT operator--(int)

{

const FLOATING_POINT temp = this;

this->operator+=(1);

return temp//return resulting value

}

FLOATING_POINT fetch_and_add( FLOATING_POINT addend )

{

const FLOATING_POINT old_value_ = atomic_value_;

this->operator+=(addend);

//atomic specification requires returning old value, not new one as in operator x=

return old_value_;

}

FLOATING_POINT fetch_and_increment()

{

const FLOATING_POINT old_value_ = atomic_value_;

this->operator+=(+1);

//atomic specification requires returning old value, not new one as in operator x=

return old_value_;

}

FLOATING_POINT fetch_and_decrement()

{

const FLOATING_POINT old_value_ = atomic_value_;

this->operator+=(-1);

//atomic specification requires returning old value, not new one as in operator x=

return old_value_;

}

};

typedef atomic_float_ AtomicFloat;

typedef atomic_float_ AtomicDouble;

???

jimdempseyatthecove · ‎10-28-2008

Robert

old_value_ = reinterpret_cast(atomic_value_);

new_value_ = old_value_ + value;

//floating point binary representation is not an issue because

//we are using our self's compare and swap, thus comparing floats and floats

If in obtaining old_value_ the compiler generate FPU instructions to copy the data .AND. if atomic_value_ contains a denormalized number, then old_value_ will never == atomic_value_ and therefore the compare and swap will always fail. In the case of the denormalized number the new_value_ will compute to the value you desire but could never get stored due to binary difference between old_value_ and atomic_value. SSE and MMX instructions may work (garbage in-garbage out). However, consider first time use of atomic_value containing uninitialized data which could potentially be several patters that are not floating point numbers. The safest way would be to copy to old_value using a cast on both sides as uint32 (of your flavor). Then the bit pattern that was in atomic_value_ is now in old_value_ and if the number were "not a normal number" then in this case the attempt at converson would occure at the "+" and not at the first "=". And whatever the result would be storable back into atomic_value_.

The main problem is not with the operator += since presumably with good coding practice this will occur only after at least one operator = that initializes the atomic_value_ to a valid floating point number (but NaN's and other non-sense can get stored too). If operator = uses the compare_and_swap you must match the bit pattern and not use the numeric equivilence (which may have a different bit pattern).

Jim Dempsey

robert_jay_gould · ‎10-28-2008

Quoting - jimdempseyatthecove

Robert

old_value_ = reinterpret_cast(atomic_value_);

new_value_ = old_value_ + value;

//floating point binary representation is not an issue because

//we are using our self's compare and swap, thus comparing floats and floats

If in obtaining old_value_ the compiler generate FPU instructions to copy the data .AND. if atomic_value_ contains a denormalized number, then old_value_ will never == atomic_value_ and therefore the compare and swap will always fail. In the case of the denormalized number the new_value_ will compute to the value you desire but could never get stored due to binary difference between old_value_ and atomic_value. SSE and MMX instructions may work (garbage in-garbage out). However, consider first time use of atomic_value containing uninitialized data which could potentially be several patters that are not floating point numbers. The safest way would be to copy to old_value using a cast on both sides as uint32 (of your flavor). Then the bit pattern that was in atomic_value_ is now in old_value_ and if the number were "not a normal number" then in this case the attempt at converson would occure at the "+" and not at the first "=". And whatever the result would be storable back into atomic_value_.

The main problem is not with the operator += since presumably with good coding practice this will occur only after at least one operator = that initializes the atomic_value_ to a valid floating point number (but NaN's and other non-sense can get stored too). If operator = uses the compare_and_swap you must match the bit pattern and not use the numeric equivilence (which may have a different bit pattern).

Jim Dempsey

Thanks for the in-depth explanation, I had misunderstood your point in your previous post.

But now I can see the possible danger there. Going back to fixing :)

RafSchietekat · ‎10-28-2008

I'm still wondering whether this is an ad hoc exercise or an indication that I should add floating-point atomics to the "Additions to atomic" patch.

During your fixing, remember that just using bit patterns would be oblivious to the equality of plus zero and minus zero for compare_and_swap, but maybe compare_and_swap should be deprecated against a compare_and_store (see the patch: in/out comparand as first parameter, new value as second parameter, third parameter indicating whether spurious failure is disallowed, boolean success status returned). Although, it could also be another parameter: by default minus zero equals plus zero, unless the user decides to differentiate them.

robert_jay_gould · ‎10-29-2008

Quoting - Raf Schietekat

I'm still wondering whether this is an ad hoc exercise or an indication that I should add floating-point atomics to the "Additions to atomic" patch.

During your fixing, remember that just using bit patterns would be oblivious to the equality of plus zero and minus zero for compare_and_swap, but maybe compare_and_swap should be deprecated against a compare_and_store (see the patch: in/out comparand as first parameter, new value as second parameter, third parameter indicating whether spurious failure is disallowed, boolean success status returned). Although, it could also be another parameter: by default minus zero equals plus zero, unless the user decides to differentiate them.

Well I think atomic floats are anecessaryfeature, but then I'm the one that began the thread.

My line of thought here is that in theory, and in most text-book examples of threading, floats can be easily dismissed.

Also when working with graphics, no-one will want to use atomic floats.

But, in my case, working on a multithreaded event-driven simulation (in this case a game server), being able to use atomic floats as object data members, can seriously reduce the amount of locks (and a lock is are fairly big when compared with a float, and even most objects). This improves memory usage (can keep more objects hot), reduces thread convoying, overall making things snappier.

It also helps keep the code cleaner and removes lots of potential deadlocks, that could happen if someone isn't careful. Now if my simulation was non-event driven, say just calculating some complex end-state, I'd be better off using vectorization, just as with graphics. But alas its not.

The only down-side here is that without the proper a instruction set atomic floats are kind of a hack (A-B-A problem), but a useful tool if usedappropriately.

Nevertheless perhaps the real solution in the long run isn't atomics though, and instead its transactional memory, but since I'm working on a commercial product with dead-lines, Intel's STM won't do me any good for now. And adoption of transactional memory if it catches on will probably be something that happens several years down the road. So (pseudo-)atomic floats in the meantime would probably be welcomed many.

RafSchietekat · ‎10-29-2008

"a lock is are fairly big when compared with a float" I think you should test your assumptions again with tbb::spin_mutex for a lock, both about size (currently always one byte) and about performance (may be substantially better than what you've seen, especially if you keep the float and the lock in the same cache line and are careful about what else to allow there).

"removes lots of potential deadlocks" That's where scoped locks come in (assuming the thread remains active).

"A-B-A problem" Depending on your application...

"Intel's STM won't do me any good for now" What product would that be, and when is it expected?

P.S.: Actually I need either 5 out of 50 plus 2 out of 9 or 6 out of 42 for that lottery thing.

AJ13 · ‎10-29-2008

I just skimmed the thread, so maybe I missed this. What is a potential use case for an atomic float? I've thought myself about them, but I'd like to hear some ideas on how they can be useful. I did see your Actor class.

Dmitry_Vyukov · ‎10-29-2008

Quoting - robert.jay.gould

But, in my case, working on a multithreaded event-driven simulation (in this case a game server), being able to use atomic floats as object data members, can seriously reduce the amount of locks (and a lock is are fairly big when compared with a float, and even most objects). This improves memory usage (can keep more objects hot), reduces thread convoying, overall making things snappier.

Mutex can be as small as 1 bit. I.e. you can pack up to 32 mutexes into single 32-bit word.

You can see some interesting bit mutex implementation here:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/bbf4e0b8d1db5b4f

The interesting thing about it is that it's possible to lock/unlock arbitrary set of mutexes with single atomic RMW.

OR you can pack mutex and {pointer, number, enum, etc} into single word, i.e. no waste of space at all.

Dmitry_Vyukov · ‎10-29-2008

Quoting - robert.jay.gould

Nevertheless perhaps the real solution in the long run isn't atomics though, and instead its transactional memory, but since I'm working on a commercial product with dead-lines, Intel's STM won't do me any good for now. And adoption of transactional memory if it catches on will probably be something that happens several years down the road. So (pseudo-)atomic floats in the meantime would probably be welcomed many.

Locks will be much-much faster and scalable. And no possibility of deadlock, since transaction involves only one object.

ARCH_R_Intel · ‎10-29-2008

The September ACM Queue (http://mags.acm.org/queue/200809/) has an excellent overview on the state of Software Transactional Memory.Its subtitle is"why it is only a research toy". The article includes performance comparisons and detailed analysis of 3 TM systems.It's conclusion starts with "Based on our results, we believe that the road ahead for STM is quite challenging"It later remarks [emphasis mine]: "We observed that the TM programming model itself, whether implemented in hardware or software, introduces complexities that limit the expected productivity gains, ...".

jimdempseyatthecove · ‎10-29-2008

Robert,

The atomic float operation is not an A-B-A problem. Instead, it is a read/modify/write problem whereby a competing thread can interceed between the read and write. This adverse interacton with respect to operator on float can be worked around using a CAS (single word compare and swap). The A-B-A problem typically involves a linked list operation whereby the head (or tail) pointer could point at node A on the read portion, of an attempted change, then the thread stalls (interrupt or context switch), then while the thread is stalled, the list changes state (potentially many times) and ends up with the head pointer point pointing at A. However, and this is the important part. Your CAS operation may be dependent on the head pointer still holding a pointer to A but your code may also be relying on node A not having been altered between the time you examined A and performed your CAS. If node A had been altered your CAS would succeed (in this example) but the copied value you obtained from A just before the CAS is not consistent with the current value. The end result being you just trashed the head of list pointer (or tail pointer). The atomic float can be protected with the CAS, and the A-B-A cannot.

To protect for A-B-A you typically use DCAS (Double word Compare And Swap) using a pointer and sequence number. This assumes your processor supports a DCAS instruction.

By the way. You are aware you could have used

#pragma omp atomic
sum += bump;

Jim Dempsey