Atomic floats, is this implementation safe? - Page 2

robert_jay_gould · ‎10-27-2008

I made an atomic float, and it's probably not blazingly fast (that's ok), but its faster than wrapping a lock around a float, and it works, but I'm not sure if this is because of my good luck, or if this is actually thread safe. I think it is... but you never know, so I came to ask the experts :)

struct AtomicFloat: public tbb::atomic

{

float compare_and_swap(float value, float compare)

{

size_t value_ = tbb::atomic::compare_and_swap((size_t&)value,(size_t&)compare);

return reinterpret_cast(value_);

}

float operator=(float value)

{

size_t value_ = (*this).tbb::atomic::operator =((size_t&)value);

return reinterpret_cast(value_);

}

operator float()

{

return reinterpret_cast(*this);

}

float operator+=(float value)

{

volatile float old_value_, new_value_;

do

{

old_value_ = reinterpret_cast(*this);

new_value_ = old_value_ + value;

} while(compare_and_swap(new_value_,old_value_) != old_value_);

return (new_value_);

}

};

Also as a caveat I'm placing a static assert for size_of(float) == size_of(size_t).

Thanks!

robert_jay_gould · ‎10-29-2008

Quoting - Dmitriy V'jukov

Mutex can be as small as 1 bit. I.e. you can pack up to 32 mutexes into single 32-bit word.

You can see some interesting bit mutex implementation here:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/bbf4e0b8d1db5b4f

The interesting thing about it is that it's possible to lock/unlock arbitrary set of mutexes with single atomic RMW.

OR you can pack mutex and {pointer, number, enum, etc} into single word, i.e. no waste of space at all.

That is a very interesting solution I hadn't thought about at all. Not sure if its the right solution for this exact problem, but its a very interesting concept that might help in other ways if I can get around to some refactoring there

robert_jay_gould · ‎10-29-2008

Quoting - jimdempseyatthecove

Robert,

The atomic float operation is not an A-B-A problem. Instead, it is a read/modify/write problem whereby a competing thread can interceed between the read and write. This adverse interacton with respect to operator on float can be worked around using a CAS (single word compare and swap). The A-B-A problem typically involves a linked list operation whereby the head (or tail) pointer could point at node A on the read portion, of an attempted change, then the thread stalls (interrupt or context switch), then while the thread is stalled, the list changes state (potentially many times) and ends up with the head pointer point pointing at A. However, and this is the important part. Your CAS operation may be dependent on the head pointer still holding a pointer to A but your code may also be relying on node A not having been altered between the time you examined A and performed your CAS. If node A had been altered your CAS would succeed (in this example) but the copied value you obtained from A just before the CAS is not consistent with the current value. The end result being you just trashed the head of list pointer (or tail pointer). The atomic float can be protected with the CAS, and the A-B-A cannot.

To protect for A-B-A you typically use DCAS (Double word Compare And Swap) using a pointer and sequence number. This assumes your processor supports a DCAS instruction.

By the way. You are aware you could have used

#pragma omp atomic
sum += bump;

Jim Dempsey

Thanks for the better A-B-A explanation, I had previously had the issue with a Queue based State-Machine (that's when I found out about the term "A-B-A problem"), and became kind of paranoid about it since it couldn't be solved with CAS (finally I just placed a lock around it since it really wasn't worth spending more time on it, because actual contention rate was very low).

As for omp atomic... well I knew that, and use it often, just didn't remember about it when I began :)

Anyways my goal was to make all the numerical members atomic, and with the same interface, since I have them bound with templates to dozens of scripting engines that have code running on them for which I'm not responsible.

RafSchietekat · ‎10-29-2008

"Anyways my goal was to make all the numerical members atomic, and with the same interface, since I have them bound with templates to dozens of scripting engines that have code running on them for which I'm not responsible." So you aren't going to do a benchmark against tbb::spin_mutex?

robert_jay_gould · ‎10-29-2008

Quoting - Raf Schietekat

"Anyways my goal was to make all the numerical members atomic, and with the same interface, since I have them bound with templates to dozens of scripting engines that have code running on them for which I'm not responsible." So you aren't going to do a benchmark against tbb::spin_mutex?

Ok your right, I was being lazy here :)

Anyways the results of my lousy (simple for loop) benchmark (and no one should probably quote these numbers) was

spin_mutex 0.0018s

atomic float 0.0036s

omp atomic 0.0036s

critical section 0.0080s

This meaning spin_mutex was the winner, but fatter than atomic float/omp atomic. So I'lldefinitelykeep spin_mutex for the moment, but will conserve the atomic solutions until I can do some testing on a more real workload, where memory might be more critical than time, or the not.

Thanks for the idea there!

RafSchietekat · ‎10-29-2008

So where is "the >30x performance boost" you mentioned earlier? What exactly is "critical section"? Note that I'm not sure that an atomic float might not still be a winner in a heavily oversubscribed situation (too many threads) if contention on individual atomics is low enough, depending on how much of a nuisance convoying really is (that would be interesting to know), but your implementation is still incorrect, even not taking into account what I did not mention yet, let alone what I would yet discover if I looked more closely... :-)

robert_jay_gould · ‎10-30-2008

Quoting - Raf Schietekat

So where is "the >30x performance boost" you mentioned earlier? What exactly is "critical section"? Note that I'm not sure that an atomic float might not still be a winner in a heavily oversubscribed situation (too many threads) if contention on individual atomics is low enough, depending on how much of a nuisance convoying really is (that would be interesting to know), but your implementation is still incorrect, even not taking into account what I did not mention yet, let alone what I would yet discover if I looked more closely... :-)

The 30x performance was when using 100 threads + classes and stuff, my previous quick-benchmark was with just 4in an omp loop, a more common usage, so more representative of "lab" performance on average(?).

And as you mention I still have a gut feeling (and some evidence) atomic floats should easily out perform spin_locks (at least in my case, not an HPC simulation), but I can trust spin_lock more than my own atomic float (for now) and it would really suck having to track down some really low level memory bug. So I'll use spin_lock for the moment.

But now that I have a spin_locked float class, I placed it in my more realistic benchmark again with 100 threads and a more complex scenario, this time I get:

atomic float: ~0.3 (it became a tad slower with the extra binary check in there)

spin_mutex: ~0.6

critical section: ~5.0 (omp critical)

mutex: ~8.0 (non-recursive fair)

So yes atomic float is still a magnitude (about 25x) faster than the mutex, (15x) faster than critical section, and (2x) faster than spin_mutex.

So far atomic float's results have been ok, no bugs (noticeable ones atleast), but I prefer the guilty until proven in this case, so until I can really trust atomic float, spinlock is a good replacement for a full mutex.

But if there was a better atomic float you bet I'd use that instead!

P.S>On a side note about performance, the overall performance (not just the tight loops) of the simulator is much better when using atomic and spin_lock than it was with Mutex (my guess is memory locality), so in reality the performance boost over all is more than what I'm showing here.

RafSchietekat · ‎10-30-2008

That does look interesting. Could you get some more resolution on the comparison between atomic float and spin_mutex (run the test a bit longer)?

robert_jay_gould · ‎10-30-2008

Quoting - Raf Schietekat

That does look interesting. Could you get some more resolution on the comparison between atomic float and spin_mutex (run the test a bit longer)?

Ok made a larger benchmark. This time all objects are tightly aligned in a vector, to reduce noise in the results (a more situation), The constants are full-loops = 100, objects = 1000 (each with 32 atomic/spin floats + some other stuff)

the variable is the amount of work per object.

And the results are

---------------------------------

Low contention scenario:

x1

spin = 0.909276

atomic = 0.940132

x5

spin = 4.8788

atomic = 5.03111

x10

spin = 9.75656

atomic = 10.077

x50

spin = 48.7615

atomic = 50.3652

x100

spin = 97.5665

atomic = 100.776

----------------------------------

Moderate contention scenario: (not much contention)

x1

spin = 0.94363

atomic = 0.994446

x5

spin = 5.04714

atomic = 5.33861

x10

spin = 9.99775

atomic = 10.5859

x50

spin = 50.0049

atomic = 52.8795

x100

spin = 100.084

atomic = 105.835

---------------------------------

High contention scenario:

x1

spin = 1.37543

atomic = 1.48458

x5

spin = 7.53178

atomic = 8.06872

x10

spin = 15.0603

atomic = 16.1485

x50

spin = 75.4305

atomic = 80.8925

x100

spin = 150.738

atomic = 161.545

-----------------------

Object Memory usage:

Spinlocked object (348 bytes)

Atomic object (216 bytes)

Observations:

-spin locks outperform atomics a little in performance, but both seem to be ok, both grow linearly and with a similar slope.

-dense atomic float objects are about 1 third smaller than dense spin float objects (good memory gain)

Guess:

-In the larger simulation, much more random and with lots of moving parts, using atomic gains us raw memory, which was probably what affected performance and not the locality of the objects.

RafSchietekat · ‎10-30-2008

Everybody's a winner and everybody gets a prize!

Maybe an atomic float can still be useful as a conveniently predefined standard tool (consistently is always nice), not necessarily lock-free...

Dmitry_Vyukov · ‎10-30-2008

Quoting - robert.jay.gould

-dense atomic float objects are about 1 third smaller than dense spin float objects (good memory gain)

If it is concern and object contains several floats you can use bit-mutex per cache-line:

struct my
{

// first cache-line

uint32_t bit_mutex1; // contains 15 mutexes, each mutex occupies 1 bit

float data1 [15];

// second cache-line

uint32_t bit_mutex2; // contains 15 mutexes, each mutex occupies 1 bit

float data2 [15];

// third cache-line

uint32_t bit_mutex3; // contains 7 mutexes, each mutex occupies 1 bit

float data3 [7];

};

Or if contention is not extremal, then you can use plain spin-mutex per cache-line:

struct my
{

// first cache-line

spin_mutex mutex1;

float data1 [15];

// second cache-line

spin_mutex mutex2;

float data2 [15];

// third cache-line

spin_mutex mutex3;

float data3 [7];

};

This is definitely a complication, but this reduces space overhead significantly.

Or if contention is moderate, then you can use just spin-mutex per whole object:

struct my
{

spin_mutex mutex;

float data [37];

};

Or if space is really a concern, then you can pack mutex with some pointer/integer/enum field:

struct my
{

my* next; // low-bit is spin-mutex

float data [37];

};

Or...

Alexey-Kukanov · ‎10-30-2008

Quoting - robert.jay.gould

That is a very interesting solution I hadn't thought about at all. Not sure if its the right solution for this exact problem, but its a very interesting concept that might help in other ways if I can get around to some refactoring there

Be aware of dangers, however. The main one for this case is of course false sharing. You might pack 512 single-bit mutexes in a 64-byte cache line, but the cache contention will be at least the same as if you just used one mutex residing in this line, if not even bigger.

Packing a mutex into the same cache line with data it protects is generally a good approach, an example where cache line sharing between different variables is appropriate.

Packing a mutex as a single bit in some other variable that does not use this bit for other purpose might be appropriate if space is really an issue. An example could be the lowest bit in a pointer, which is usually zero due to natural alignment (of course unless you use this pointer to iterate over a single byte collection such as C string). But it adds complexities both to lock&unlock operations (you should preserve the value of the variable; and in case it could change outside of the locked scope, you must also account to such possibility) and to reads&writes (masks have to be used).

robert_jay_gould · ‎10-30-2008

Quoting - Alexey Kukanov (Intel)

Be aware of dangers, however. The main one for this case is of course false sharing. You might pack 512 single-bit mutexes in a 64-byte cache line, but the cache contention will be at least the same as if you just used one mutex residing in this line, if not even bigger.

Good point, there. It might become anover-optimizationthat doesn't pay.

Anyways for the moment I'll stick with Raj's idea of placing a spinlock around each float, the performance gain in the overall system didn't change that much when compared to atomic float, and its already a huge improvement over the serial and common-mutex solutions. Also right now memory isn't that tight, so I'll let things be, but I'll be keeping in mind that there is aconsiderablememory saving to be gained here if memory does become an issue.

Nevertheless, if someone could add a true atomic float to tbb. I'd still be happy :)

jimdempseyatthecove · ‎10-30-2008

Alexey, Dmitriy, Robert,

An alternate way is to have the float data itself be the mutex. In IEEE 754 Floating Point Standard there exist certain bit patters called SNaN (Signaling Not a Number). These represent invalid operations. You could elect to declare that should your program ever generate a SNaN that the signal routine (FPU intrerrupt routine if one present) is to take appropriate action or terminate the program. Where non-termination appropriate action could be to convert the SNaN into a QNaN (Quiet Not a Number). Since your program now by design can never legitimately generate an SNaN you are free to use SNaN's for your own purpose. One or moreof which you can use to express a mutex lock, others of which you can use to propigate SNaN's if you so wish to.An SNaN can have a 0/1 sign bit, all 1's for exponent (8 bits), a 0 for the msb of the23-bit mantissaand anything but 0 for the remaining 22 bits of the mantissa. Binary

s111 111 110x xxxx xxxx xxxx xxxx xxxx

s can be 0/1
x's can express any 22-bit binary value except 0

0x7F800001would be an example of an SNaN. The non-0 xxx's could also contain a 1-based thread number or 22-bits worth of bit mask.

Now to perform an atomic float reduction (operator +=)

const int32_t LOCK = 0x7F800001;
union
{
float f;
int32_t i;
};
while((i=XCHG((pointer, LOCK) == LOCK)
_mm_pause();
*pointer = f + r;

(you can add the casts)

No additional memory consumed for mutex, only 1 memory lock class instruction issued the (InterlockedExchange).
You may or may not wish to follow the write back with a _mm_sfence(); But this will unnecessarily delay the thread releasing the LOCK (at the expense of a delay toa potential other thread waitingto observethe release). If the floats have very high contentions and you wish some fairness then consider using the _mm_sfence() (or a 2nd XCHG). *** Also, depending on processor archetecture the 2nd XCHG may be required IA32 and EM64T (x64) would not, Itanium might (I don't use that ).

Jim Dempsey

RafSchietekat · ‎10-31-2008

"An alternate way is to have the float data itself be the mutex." This should be smaller and faster than with spin_mutex, but probably only if there are few readers (with a spin_lock the value itself can still be read and written atomically in the original sense of the word), and if a thread holding a lock becomes unavailable the value will be lost, so I would hesitate picking this implementation but it may have its uses.

jimdempseyatthecove · ‎10-31-2008

Quoting - Raf Schietekat

"An alternate way is to have the float data itself be the mutex." This should be smaller and faster than with spin_mutex, but probably only if there are few readers (with a spin_lock the value itself can still be read and written atomically in the original sense of the word), and if a thread holding a lock becomes unavailable the value will be lost, so I would hesitate picking this implementation but it may have its uses.

With spin_lock the protocol permits reading/using of the value (float) provided the operation being performed by the owning thread results in an atomic write with respect to the memory bus (typically the case). Also the spin_lock protocol prohibits the writing of the value by the threads not holding the lock.

With SNaN technique the protocol inhibits using the value under change during the ownership of the lock (presumably a very few number of statements). Also the SNaNprotocol prohibits the writing of the value by the threads not holding the lock.

In both cases the protocol can be violated by non-owning thread writing to variable (but this is error in programming).

Note, mutex and SNaN can be avoided by using CAS (InterlockedCompareExchange) provided it works in with your programming requirements.

The question then becomes: is it desirable to permit reading of the atomic float while atomic update is pending?

Do not be too quick to answer yes.

Permit me to make a reasonably good argument for SNaN:

One of the main reasons to use an atomic is for the purpose of a reduction variable. Assumingthe floatis used for a reduction variable then the preponderance operations or only operationsuntil solution complete is RMW. Should you use a disjointed mutex (i.e. not comingled with the float) then you will be performing an interlockedRMW on a mutex (optionally preceded by a read loop waiting until mutex is free), then performing a non-interlocked RMW followed by a fenced write of the mutex. During the reduction phase of the life of the variable the code will never simply perform a direct read without obtaining the lock (for the purpose of the RMW). After the reduction phase is complete, code typically returns to a Join and proceeds to use the variable (float) but never required as an atomic. The variable has a split personality during its lifetime. At times it is required to be atomic at other times it does not.

When using a separated mutex there is an advantage in placing the mutex within the same cache line as the atomic (one cache line to mess with instead of two). This not only takes up space for the mutex but also means that the atomic plus mutex are cache line co-resident (64 byte structure or so depending on your implementation, but this could be 8 bytes if youare careful). Dmitriy's suggestion of packing the atomic's with bit-filed mutex (mutii?) is either a blessing or curse depending on the manner in which the atomic's were used. If single thread has preponderance of updates to multiple atomic's then blessing. If multiple threads have preponderance of updates to a particular atomic then curse. In second caseplacing mutex and atomic pair into separate cache lines from other mutex and atomic pair would be better technique.

The placement of mutex and atomic are application dependent.

Now for foibles of unbridled exuberance.

If one can make a single atomic one can make an array of atomic's that must be better. If one can make an array of atomics then one can make a stl::vector >,that must be better yet. Then one could consider using a tbb::concurrent_vector > that has to be best yet

I can only say that your unbridled exuberance in using new and improved potentially comes at a severe cost (which you must be willing to pay).

Using an array of atomics (with interspersed and separate mutex), and worse yet using either form of vector with said atomic inhibits your ability to use the processor small vector instructions (e.g. SSEn.n).

If you require the use of hardware vector instructions for performance reasons then you cannot use arrays of a cache-line joined mutex+atomic. To a limited extent you can with Dmitriys bitmask mutex and small array and then stride your way along. But then pre-fetching becomes problematic.

When it is desirable to use hardware vector instructions then you must assure that the floats are aligned properly and compacted into a contiguous array (or sub-subsections of the array) of floats that do not move during the duration of the vectored loop. In this case you (may) need a lock on the vector but you also need to assure that the vector is contiguous. stl::vector may provide contiguous vectors today but are not assured to later (under new and improved version update), tbb::concurrent_vector today does not assure all portion of vector is contiguous. And all flavors of mutex + atomic guarantee the floats are dis-contiguous.

What provides for both individual atomic and hardware vector exclusive use is for you to roll your own container where you have contiguous array with lock on array, optionally locks on zones of array, and potentially individual cell locks using SNaN technique.

For individual atomic SNaN may be the most efficient technique.

Also not discussed yet is if coded properly, the use of SNaN permits the programmer to code for read access to lockable variable not locked without explicit test if variable is locked. You let the hardware perform the lock test by generating a floating point exception and which your exception handler notes the designated SNaN and retries. If the expected collision frequency is high then either use CAS for reduction or peek and test for SNaN or use separate mutex. Again the application requirement determine how best to protect the float.

The biggest problem for any of these atomic reduction type of operations is for the annoying problem of an unnecessary interruption of your code sequence progressing from the lock through the release. If a clock tick or some other event occurs between the lock and the release then the lock is held for a very long time as opposed to a very short time. This is bad news for any thread waiting for the lock.

On various other forum messages I suggested to Intel that they seriously consider adding to the instruction set two instructions

Temporarily Inhibit Interrupts (TII)
Cancel Temporary Inhibit Interrupt (CTII)

The TTI could be followed by an immediate value indicating the number of instruction fetch cycles to delay acceptance of interrupts from outside sources such as clock but not events such as divide by 0, page fault etc To avoid abusive programming practices, TII could implicitly perform a CTII (should TII be pending) and honor pending interrupts prior to entering TII state.

The TII state then runs for the specified number of instruction fetches, or until fault, or until CTII occurs.

What this buys you is, excepting for fault, your program can guarantee short duration locks. And with the guarantee of short duration locks you can avoid all this unnecessary coding effort to work around the occasional circumstance of the lock owning thread being interrupted and scheduled out. This would mean you could dispense with Lock-Free/Wait-Free programming (at least for the simple cases), you can dispense with calling SwitchToThread in your busy wait loop, and possibly dispense with _mm_pause. Programming becomes much simpler and more efficient.

Jim Dempsey

RafSchietekat · ‎10-31-2008

"Also the spin_lock protocol prohibits the writing of the value by the threads not holding the lock." Yes, I meant (only) "read" of course, not "read and written", sorry (again).

Maybe I just have not seen enough code, but for the word count in thread "Scalable hash map algorithm" http://software.intel.com/en-us/forums/showthread.php?t=60494 it was easy to find a solution that did not involve sharing data. It's only a hunch, though, that code that almost never reads can probably be optimised similarly.

Plural of "mutex" would never be "mutii" even if it were a Latin word (is it a "portmanteau word" for "mutual exclusion" or would it be called something else?); "mutices" would be more likely. But it seems fruitless to try to use Latin inflexions in English: the plural of "fructus" is still "fructus" when written. Some words only look like Latin: the plural of "octopus" would more likely be "octopodoi" (but then why is this not just a single animal with eight "feet"?), and the word itself it transcribed from something looking more like "oktopous".

You're making unfounded assumptions about my "exuberance", but the compatibility with vector instructions seems a potentially relevant consideration. Use of signalling looks quite clever, although it probably should not be used in a basic atomic.

The idea behind "Temporarily Inhibit Interrupts" has been implemented before as mentioned elsewhere on this forum, and seems very useful indeed.

RafSchietekat · ‎11-01-2008

Quoting - Raf Schietekat

[...]

Plural of "mutex" would never be "mutii" even if it were a Latin word (is it a "portmanteau word" for "mutual exclusion" or would it be called something else?); "mutices" would be more likely. But it seems fruitless to try to use Latin inflexions in English: the plural of "fructus" is still "fructus" when written. Some words only look like Latin: the plural of "octopus" would more likely be "octopodoi" (but then why is this not just a single animal with eight "feet"?), and the word itself it transcribed from something looking more like "oktopous".

[...]

To avoid confusion: the plural forms are just "mutexes" (if you don't want to use "mutex variables") and "octopuses" (I wish I had not used this example, though, because "octopi" is also accepted). Now I hope this will be the end of that.

P.S.: Thanks to Robert for those lottery numbers, I have now become independently wealthy... not!

RafSchietekat · ‎11-02-2008

I thought I might have a go at adding atomic floats, but I got stuck on the exact behaviour.

First I decided that plus zero and minus zero are equal for operator==, and must therefore be equal for compare_and_swap, and therefore also for compare_and_store (and friends) because of functional equivalence when spurious failure is disallowed. For operations where spurious failure is not disallowed, it would be undefined whether this constitutes spurious failure (any use should be prepared to retry with the updated comparand value, but there would be no guarantee that the retry would not be internalised for not being a contention-related occurrence). So far so good?

The problem with a NaN value is that it is not equal to anything, even itself (right?). So it would never be possible to replace a NaN value (using the reasoning above for plus/minus zero), and there would be numerous opportunities for infinite loops. One solution would be to not define behaviour with NaN values, but this seems rather problematic and would not win atomic a lot of friends. Another solution is to disallow NaN values and throw an exception when they are detected or generated (divide by zero). Yet another solution is not to provide compare_and_swap and use the new freedom to redefine the other compare_and_... operations to allow success based on value representation. What would it all mean for a series of divide operations?

Any solution should probably be compatible with as yet undefined atomics on general object types, which would probably be required to have an equality operator. Maybe the unifying concept could be to throw an exception whenever a value is detected that is not equal to itself?

Any thoughts?

RafSchietekat · ‎11-03-2008

Quoting - Raf Schietekat

To avoid confusion: the plural forms are just "mutexes" (if you don't want to use "mutex variables") and "octopuses" (I wish I had not used this example, though, because "octopi" is also accepted). Now I hope this will be the end of that.

[...]

"Now I hope this will be the end of that." And of course I could be expected to violate that myself: "octopus", while having its origin in Greek, is indeed a Latin word, but with plural form "octopodes" (third declension, stem "octopod"), so "octopi" is just plain wrong and I was right after all. :-)

jimdempseyatthecove · ‎11-03-2008

Raf

The compare_and_swap, when you drill down to the instruction set of the computer (IA32, IA64, EM64T/x64) does NOT perform a == for the compare. instead it performs a 32-bit word (32-bit aligned) or a 64-bit word/dword (64-bit aligned) or in the case of a CMPXCHG16B a 128 bit aligned BINARY compare. IEE -0 and +0 will exhibit equality on== but not when the binary bit patters are compared using 32-bit word, 64-bitwordor 128 bit bit-by-bit compares. Further the NaN's are perfectly valid bit patters for binary words (of respective lengths). Uninitialized data has a significant probability of being NaN's.

As I attempted to explain earlier, the CMPXCHG instruction uses the binary pattern of the word and not an equivilent float. Therefore, in order to assure that the old value you obtain, for purpose for use with CMPXCHG later, must be obtained using memory to register/register to memorymove instructions and not by way of floating point load and floating point store which may modify the bit pattern in the process or throw an exception in the case of SNaN. Once the old value is obtained you are free to use it (the copy)as a float in and expressionor simply ignore it until it comes time to be used for the CMPXCHG. In this manner even a NaN can be used for the CMPXCHG while not used in expression (e.g. when you initialize an atomic float).

Jim Dempsey

RafSchietekat · ‎11-03-2008

Yes, the instructions work on the "value representations" (bit patterns), but I was looking for a way to make things work nicely for the user with the equality test that would be performed on the result of compare_and_swap (avoiding "oops" moments where the test says yes and the swap didn't occur anyway). NaNs in uninitialised data would not be a concern because simple assignment gets rid of them easily; the concern is that naive or generic compare_and_... loops might get stuck forever, but the assumption would still be that such loops are not used with uninitialised or unassigned data: there's a difference between forgiving and totally foolproof.

Are you saying that operations on float/double variables are not restricted to their domain by default? I thought that only intermediate values in a calculation are kept in longer registers, and that these are round-trip faithful anyway. Equality tests would have to be between variables, of course, not between a variable and a calculated result.

Anyway, I'm not set on getting somewhere with this, just considering the possibilities for the moment.