- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I made an atomic float, and it's probably not blazingly fast (that's ok), but its faster than wrapping a lock around a float, and it works, but I'm not sure if this is because of my good luck, or if this is actually thread safe. I think it is... but you never know, so I came to ask the experts :)
struct AtomicFloat: public tbb::atomic
{
float compare_and_swap(float value, float compare)
{
size_t value_ = tbb::atomic
return reinterpret_cast
}
float operator=(float value)
{
size_t value_ = (*this).tbb::atomic
return reinterpret_cast
}
operator float()
{
return reinterpret_cast
}
float operator+=(float value)
{
volatile float old_value_, new_value_;
do
{
old_value_ = reinterpret_cast
new_value_ = old_value_ + value;
} while(compare_and_swap(new_value_,old_value_) != old_value_);
return (new_value_);
}
};
Also as a caveat I'm placing a static assert for size_of(float) == size_of(size_t).
Thanks!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Mutex can be as small as 1 bit. I.e. you can pack up to 32 mutexes into single 32-bit word.
You can see some interesting bit mutex implementation here:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/bbf4e0b8d1db5b4f
The interesting thing about it is that it's possible to lock/unlock arbitrary set of mutexes with single atomic RMW.
OR you can pack mutex and {pointer, number, enum, etc} into single word, i.e. no waste of space at all.
That is a very interesting solution I hadn't thought about at all. Not sure if its the right solution for this exact problem, but its a very interesting concept that might help in other ways if I can get around to some refactoring there
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Robert,
The atomic float operation is not an A-B-A problem. Instead, it is a read/modify/write problem whereby a competing thread can interceed between the read and write. This adverse interacton with respect to operator on float can be worked around using a CAS (single word compare and swap). The A-B-A problem typically involves a linked list operation whereby the head (or tail) pointer could point at node A on the read portion, of an attempted change, then the thread stalls (interrupt or context switch), then while the thread is stalled, the list changes state (potentially many times) and ends up with the head pointer point pointing at A. However, and this is the important part. Your CAS operation may be dependent on the head pointer still holding a pointer to A but your code may also be relying on node A not having been altered between the time you examined A and performed your CAS. If node A had been altered your CAS would succeed (in this example) but the copied value you obtained from A just before the CAS is not consistent with the current value. The end result being you just trashed the head of list pointer (or tail pointer). The atomic float can be protected with the CAS, and the A-B-A cannot.
To protect for A-B-A you typically use DCAS (Double word Compare And Swap) using a pointer and sequence number. This assumes your processor supports a DCAS instruction.
By the way. You are aware you could have used
#pragma omp atomic
sum += bump;
Jim Dempsey
Thanks for the better A-B-A explanation, I had previously had the issue with a Queue based State-Machine (that's when I found out about the term "A-B-A problem"), and became kind of paranoid about it since it couldn't be solved with CAS (finally I just placed a lock around it since it really wasn't worth spending more time on it, because actual contention rate was very low).
As for omp atomic... well I knew that, and use it often, just didn't remember about it when I began :)
Anyways my goal was to make all the numerical members atomic, and with the same interface, since I have them bound with templates to dozens of scripting engines that have code running on them for which I'm not responsible.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"Anyways my goal was to make all the numerical members atomic, and with the same interface, since I have them bound with templates to dozens of scripting engines that have code running on them for which I'm not responsible." So you aren't going to do a benchmark against tbb::spin_mutex?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"Anyways my goal was to make all the numerical members atomic, and with the same interface, since I have them bound with templates to dozens of scripting engines that have code running on them for which I'm not responsible." So you aren't going to do a benchmark against tbb::spin_mutex?
Ok your right, I was being lazy here :)
Anyways the results of my lousy (simple for loop) benchmark (and no one should probably quote these numbers) was
spin_mutex 0.0018s
atomic float 0.0036s
omp atomic 0.0036s
critical section 0.0080s
This meaning spin_mutex was the winner, but fatter than atomic float/omp atomic. So I'lldefinitelykeep spin_mutex for the moment, but will conserve the atomic solutions until I can do some testing on a more real workload, where memory might be more critical than time, or the not.
Thanks for the idea there!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The 30x performance was when using 100 threads + classes and stuff, my previous quick-benchmark was with just 4in an omp loop, a more common usage, so more representative of "lab" performance on average(?).
And as you mention I still have a gut feeling (and some evidence) atomic floats should easily out perform spin_locks (at least in my case, not an HPC simulation), but I can trust spin_lock more than my own atomic float (for now) and it would really suck having to track down some really low level memory bug. So I'll use spin_lock for the moment.
But now that I have a spin_locked float class, I placed it in my more realistic benchmark again with 100 threads and a more complex scenario, this time I get:
atomic float: ~0.3 (it became a tad slower with the extra binary check in there)
mutex: ~8.0 (non-recursive fair)
So yes atomic float is still a magnitude (about 25x) faster than the mutex, (15x) faster than critical section, and (2x) faster than spin_mutex.
So far atomic float's results have been ok, no bugs (noticeable ones atleast), but I prefer the guilty until proven in this case, so until I can really trust atomic float, spinlock is a good replacement for a full mutex.
But if there was a better atomic float you bet I'd use that instead!
P.S>On a side note about performance, the overall performance (not just the tight loops) of the simulator is much better when using atomic and spin_lock than it was with Mutex (my guess is memory locality), so in reality the performance boost over all is more than what I'm showing here.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That does look interesting. Could you get some more resolution on the comparison between atomic float and spin_mutex (run the test a bit longer)?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That does look interesting. Could you get some more resolution on the comparison between atomic float and spin_mutex (run the test a bit longer)?
Ok made a larger benchmark. This time all objects are tightly aligned in a vector, to reduce noise in the results (a more situation), The constants are full-loops = 100, objects = 1000 (each with 32 atomic/spin floats + some other stuff)
the variable is the amount of work per object.
And the results are
---------------------------------
Low contention scenario:
x1
spin = 0.909276
atomic = 0.940132
x5
spin = 4.8788
atomic = 5.03111
x10
spin = 9.75656
atomic = 10.077
x50
spin = 48.7615
atomic = 50.3652
x100
spin = 97.5665
atomic = 100.776
----------------------------------
Moderate contention scenario: (not much contention)
x1
spin = 0.94363
atomic = 0.994446
x5
spin = 5.04714
atomic = 5.33861
x10
spin = 9.99775
atomic = 10.5859
x50
spin = 50.0049
atomic = 52.8795
x100
spin = 100.084
atomic = 105.835
---------------------------------
High contention scenario:
x1
spin = 1.37543
atomic = 1.48458
x5
spin = 7.53178
atomic = 8.06872
x10
spin = 15.0603
atomic = 16.1485
x50
spin = 75.4305
atomic = 80.8925
x100
spin = 150.738
atomic = 161.545
-----------------------
Object Memory usage:
Spinlocked object (348 bytes)
Atomic object (216 bytes)
Observations:
-spin locks outperform atomics a little in performance, but both seem to be ok, both grow linearly and with a similar slope.
-dense atomic float objects are about 1 third smaller than dense spin float objects (good memory gain)
Guess:
-In the larger simulation, much more random and with lots of moving parts, using atomic gains us raw memory, which was probably what affected performance and not the locality of the objects.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Everybody's a winner and everybody gets a prize!
Maybe an atomic float can still be useful as a conveniently predefined standard tool (consistently is always nice), not necessarily lock-free...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
-dense atomic float objects are about 1 third smaller than dense spin float objects (good memory gain)
If it is concern and object contains several floats you can use bit-mutex per cache-line:
struct my
{
// first cache-line
uint32_t bit_mutex1; // contains 15 mutexes, each mutex occupies 1 bit
float data1 [15];
// second cache-line
uint32_t bit_mutex2; // contains 15 mutexes, each mutex occupies 1 bit
float data2 [15];
// third cache-line
uint32_t bit_mutex3; // contains 7 mutexes, each mutex occupies 1 bit
float data3 [7];
};
Or if contention is not extremal, then you can use plain spin-mutex per cache-line:
struct my
{
// first cache-line
spin_mutex mutex1;
float data1 [15];
// second cache-line
spin_mutex mutex2;
float data2 [15];
// third cache-line
spin_mutex mutex3;
float data3 [7];
};
This is definitely a complication, but this reduces space overhead significantly.
Or if contention is moderate, then you can use just spin-mutex per whole object:
struct my
{
spin_mutex mutex;
float data [37];
};
Or if space is really a concern, then you can pack mutex with some pointer/integer/enum field:
struct my
{
my* next; // low-bit is spin-mutex
float data [37];
};
Or...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Be aware of dangers, however. The main one for this case is of course false sharing. You might pack 512 single-bit mutexes in a 64-byte cache line, but the cache contention will be at least the same as if you just used one mutex residing in this line, if not even bigger.
Packing a mutex into the same cache line with data it protects is generally a good approach, an example where cache line sharing between different variables is appropriate.
Packing a mutex as a single bit in some other variable that does not use this bit for other purpose might be appropriate if space is really an issue. An example could be the lowest bit in a pointer, which is usually zero due to natural alignment (of course unless you use this pointer to iterate over a single byte collection such as C string). But it adds complexities both to lock&unlock operations (you should preserve the value of the variable; and in case it could change outside of the locked scope, you must also account to such possibility) and to reads&writes (masks have to be used).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Be aware of dangers, however. The main one for this case is of course false sharing. You might pack 512 single-bit mutexes in a 64-byte cache line, but the cache contention will be at least the same as if you just used one mutex residing in this line, if not even bigger.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alexey, Dmitriy, Robert,
An alternate way is to have the float data itself be the mutex. In IEEE 754 Floating Point Standard there exist certain bit patters called SNaN (Signaling Not a Number). These represent invalid operations. You could elect to declare that should your program ever generate a SNaN that the signal routine (FPU intrerrupt routine if one present) is to take appropriate action or terminate the program. Where non-termination appropriate action could be to convert the SNaN into a QNaN (Quiet Not a Number). Since your program now by design can never legitimately generate an SNaN you are free to use SNaN's for your own purpose. One or moreof which you can use to express a mutex lock, others of which you can use to propigate SNaN's if you so wish to.An SNaN can have a 0/1 sign bit, all 1's for exponent (8 bits), a 0 for the msb of the23-bit mantissaand anything but 0 for the remaining 22 bits of the mantissa. Binary
s111 111 110x xxxx xxxx xxxx xxxx xxxx
s can be 0/1
x's can express any 22-bit binary value except 0
0x7F800001would be an example of an SNaN. The non-0 xxx's could also contain a 1-based thread number or 22-bits worth of bit mask.
Now to perform an atomic float reduction (operator +=)
const int32_t LOCK = 0x7F800001;
union
{
float f;
int32_t i;
};
while((i=XCHG((pointer, LOCK) == LOCK)
_mm_pause();
*pointer = f + r;
(you can add the casts)
No additional memory consumed for mutex, only 1 memory lock class instruction issued the (InterlockedExchange).
You may or may not wish to follow the write back with a _mm_sfence(); But this will unnecessarily delay the thread releasing the LOCK (at the expense of a delay toa potential other thread waitingto observethe release). If the floats have very high contentions and you wish some fairness then consider using the _mm_sfence() (or a 2nd XCHG). *** Also, depending on processor archetecture the 2nd XCHG may be required IA32 and EM64T (x64) would not, Itanium might (I don't use that ).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"An alternate way is to have the float data itself be the mutex." This should be smaller and faster than with spin_mutex, but probably only if there are few readers (with a spin_lock the value itself can still be read and written atomically in the original sense of the word), and if a thread holding a lock becomes unavailable the value will be lost, so I would hesitate picking this implementation but it may have its uses.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Raf Schietekat
"An alternate way is to have the float data itself be the mutex." This should be smaller and faster than with spin_mutex, but probably only if there are few readers (with a spin_lock the value itself can still be read and written atomically in the original sense of the word), and if a thread holding a lock becomes unavailable the value will be lost, so I would hesitate picking this implementation but it may have its uses.
With spin_lock the protocol permits reading/using of the value (float) provided the operation being performed by the owning thread results in an atomic write with respect to the memory bus (typically the case). Also the spin_lock protocol prohibits the writing of the value by the threads not holding the lock.
With SNaN technique the protocol inhibits using the value under change during the ownership of the lock (presumably a very few number of statements). Also the SNaNprotocol prohibits the writing of the value by the threads not holding the lock.
In both cases the protocol can be violated by non-owning thread writing to variable (but this is error in programming).
Note, mutex and SNaN can be avoided by using CAS (InterlockedCompareExchange) provided it works in with your programming requirements.
The question then becomes: is it desirable to permit reading of the atomic float while atomic update is pending?
Do not be too quick to answer yes.
Permit me to make a reasonably good argument for SNaN:
One of the main reasons to use an atomic
When using a separated mutex there is an advantage in placing the mutex within the same cache line as the atomic
The placement of mutex and atomic
Now for foibles of unbridled exuberance.
If one can make a single atomic
I can only say that your unbridled exuberance in using new and improved potentially comes at a severe cost (which you must be willing to pay).
Using an array of atomic
If you require the use of hardware vector instructions for performance reasons then you cannot use arrays of a cache-line joined mutex+atomic
When it is desirable to use hardware vector instructions then you must assure that the floats are aligned properly and compacted into a contiguous array (or sub-subsections of the array) of floats that do not move during the duration of the vectored loop. In this case you (may) need a lock on the vector but you also need to assure that the vector is contiguous. stl::vector
What provides for both individual atomic and hardware vector exclusive use is for you to roll your own container where you have contiguous array with lock on array, optionally locks on zones of array, and potentially individual cell locks using SNaN technique.
For individual atomic
Also not discussed yet is if coded properly, the use of SNaN permits the programmer to code for read access to lockable variable not locked without explicit test if variable is locked. You let the hardware perform the lock test by generating a floating point exception and which your exception handler notes the designated SNaN and retries. If the expected collision frequency is high then either use CAS for reduction or peek and test for SNaN or use separate mutex. Again the application requirement determine how best to protect the float.
The biggest problem for any of these atomic
On various other forum messages I suggested to Intel that they seriously consider adding to the instruction set two instructions
Temporarily Inhibit Interrupts (TII)
Cancel Temporary Inhibit Interrupt (CTII)
The TTI could be followed by an immediate value indicating the number of instruction fetch cycles to delay acceptance of interrupts from outside sources such as clock but not events such as divide by 0, page fault etc To avoid abusive programming practices, TII could implicitly perform a CTII (should TII be pending) and honor pending interrupts prior to entering TII state.
The TII state then runs for the specified number of instruction fetches, or until fault, or until CTII occurs.
What this buys you is, excepting for fault, your program can guarantee short duration locks. And with the guarantee of short duration locks you can avoid all this unnecessary coding effort to work around the occasional circumstance of the lock owning thread being interrupted and scheduled out. This would mean you could dispense with Lock-Free/Wait-Free programming (at least for the simple cases), you can dispense with calling SwitchToThread in your busy wait loop, and possibly dispense with _mm_pause. Programming becomes much simpler and more efficient.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"Also the spin_lock protocol prohibits the writing of the value by the threads not holding the lock." Yes, I meant (only) "read" of course, not "read and written", sorry (again).
Maybe I just have not seen enough code, but for the word count in thread "Scalable hash map algorithm" http://software.intel.com/en-us/forums/showthread.php?t=60494 it was easy to find a solution that did not involve sharing data. It's only a hunch, though, that code that almost never reads can probably be optimised similarly.
Plural of "mutex" would never be "mutii" even if it were a Latin word (is it a "portmanteau word" for "mutual exclusion" or would it be called something else?); "mutices" would be more likely. But it seems fruitless to try to use Latin inflexions in English: the plural of "fructus" is still "fructus" when written. Some words only look like Latin: the plural of "octopus" would more likely be "octopodoi" (but then why is this not just a single animal with eight "feet"?), and the word itself it transcribed from something looking more like "oktopous".
You're making unfounded assumptions about my "exuberance", but the compatibility with vector instructions seems a potentially relevant consideration. Use of signalling looks quite clever, although it probably should not be used in a basic atomic
The idea behind "Temporarily Inhibit Interrupts" has been implemented before as mentioned elsewhere on this forum, and seems very useful indeed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[...]
Plural of "mutex" would never be "mutii" even if it were a Latin word (is it a "portmanteau word" for "mutual exclusion" or would it be called something else?); "mutices" would be more likely. But it seems fruitless to try to use Latin inflexions in English: the plural of "fructus" is still "fructus" when written. Some words only look like Latin: the plural of "octopus" would more likely be "octopodoi" (but then why is this not just a single animal with eight "feet"?), and the word itself it transcribed from something looking more like "oktopous".
[...]
To avoid confusion: the plural forms are just "mutexes" (if you don't want to use "mutex variables") and "octopuses" (I wish I had not used this example, though, because "octopi" is also accepted). Now I hope this will be the end of that.
P.S.: Thanks to Robert for those lottery numbers, I have now become independently wealthy... not!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I thought I might have a go at adding atomic floats, but I got stuck on the exact behaviour.
First I decided that plus zero and minus zero are equal for operator==, and must therefore be equal for compare_and_swap, and therefore also for compare_and_store (and friends) because of functional equivalence when spurious failure is disallowed. For operations where spurious failure is not disallowed, it would be undefined whether this constitutes spurious failure (any use should be prepared to retry with the updated comparand value, but there would be no guarantee that the retry would not be internalised for not being a contention-related occurrence). So far so good?
The problem with a NaN value is that it is not equal to anything, even itself (right?). So it would never be possible to replace a NaN value (using the reasoning above for plus/minus zero), and there would be numerous opportunities for infinite loops. One solution would be to not define behaviour with NaN values, but this seems rather problematic and would not win atomic
Any solution should probably be compatible with as yet undefined atomics on general object types, which would probably be required to have an equality operator. Maybe the unifying concept could be to throw an exception whenever a value is detected that is not equal to itself?
Any thoughts?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[...]
"Now I hope this will be the end of that." And of course I could be expected to violate that myself: "octopus", while having its origin in Greek, is indeed a Latin word, but with plural form "octopodes" (third declension, stem "octopod"), so "octopi" is just plain wrong and I was right after all. :-)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raf
The compare_and_swap, when you drill down to the instruction set of the computer (IA32, IA64, EM64T/x64) does NOT perform a
As I attempted to explain earlier, the CMPXCHG instruction uses the binary pattern of the word and not an equivilent float. Therefore, in order to assure that the old value you obtain, for purpose for use with CMPXCHG later, must be obtained using memory to register/register to memorymove instructions and not by way of floating point load and floating point store which may modify the bit pattern in the process or throw an exception in the case of SNaN. Once the old value is obtained you are free to use it (the copy)as a float in and expressionor simply ignore it until it comes time to be used for the CMPXCHG. In this manner even a NaN can be used for the CMPXCHG while not used in expression (e.g. when you initialize an atomic float).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, the instructions work on the "value representations" (bit patterns), but I was looking for a way to make things work nicely for the user with the equality test that would be performed on the result of compare_and_swap (avoiding "oops" moments where the test says yes and the swap didn't occur anyway). NaNs in uninitialised data would not be a concern because simple assignment gets rid of them easily; the concern is that naive or generic compare_and_... loops might get stuck forever, but the assumption would still be that such loops are not used with uninitialised or unassigned data: there's a difference between forgiving and totally foolproof.
Are you saying that operations on float/double variables are not restricted to their domain by default? I thought that only intermediate values in a calculation are kept in longer registers, and that these are round-trip faithful anyway. Equality tests would have to be between variables, of course, not between a variable and a calculated result.
Anyway, I'm not set on getting somewhere with this, just considering the possibilities for the moment.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page