Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Why is TBB missing atomic floats?

Philipp1
Beginner
462 Views
I used TBB zu parallelize an application where multiple threads update arbitrary elements of a huge field of double precision floats. So far, I used locks to prevent race conditions.
In a similar implementation based on OpenMP I used #pragma omp atomic to avoid race conditions. Though atomic floats are very slow, this payed of in a much better scalability and eventually in better performance.

As far as I understand from the documentation of TBB and this forum, there is no support for atomic floats in TBB because there is no need for it. Anyway, is this the only reason? Or are there do hardware limitations hamper the implementation of atomic floats in TBB?
0 Kudos
11 Replies
RafSchietekat
Valued Contributor III
462 Views

Perhaps you mightlookin "Additions to atomic"(floating-point support added in #138, latest version in #144), and provide some feedback?

0 Kudos
Philipp1
Beginner
462 Views
Quoting - Raf Schietekat

Perhaps you mightlookin "Additions to atomic"(floating-point support added in #138, latest version in #144), and provide some feedback?


Thank you for your prompt reply.

I recompiled TBB using your path and it works just fine. More important, in my application, atomic operations in TBB perform equally to atomic operations in OpenMP. There is only one minor problem: I have to stick to the atomic data type, that is, all modifications are atomic and I also have to overload methods that should deal with this data type.

Anyway, when will this path become part of any TBB release? Is there still no need for atomic floats?

Again, thank you for your help.

0 Kudos
RafSchietekat
Valued Contributor III
462 Views
"I recompiled TBB using your path and it works just fine. More important, in my application, atomic operations in TBB perform equally to atomic operations in OpenMP."
Thanks, glad to hear that. Be sure to really test it, though.

"I have to stick to the atomic data type, that is, all modifications are atomic and I also have to overload methods that should deal with this data type."
What exactly do you mean by that?

"Anyway, when will this path become part of any TBB release? Is there still no need for atomic floats?"
Intel's compiler has some C++0x features already (I'd have to check about atomics), so maybe that's why an integration of this into mainline TBB is perceived as less urgent; I have no idea when C++0x features can be expected to be widely available. It's up to potential users like yourself to express any need for atomic floats.
0 Kudos
Bartlomiej
New Contributor I
462 Views
Wow! I have to admit, I was convinced atomic floating-point operations would requie some support from the CPU that is missing. It's a great news, I was wrong! ;-)
As a numerical-computations-guy, I can see several applications for this feature.

And I hope the patch is compatible with TBB 2.2.

0 Kudos
Philipp1
Beginner
462 Views
Quoting - Raf Schietekat
"I recompiled TBB using your path and it works just fine. More important, in my application, atomic operations in TBB perform equally to atomic operations in OpenMP."
Thanks, glad to hear that. Be sure to really test it, though.

I compared the output (8 core CPU) to that of a sequential implementation that uses no threading library at all. The output is identical. The output always differed, when not preventing race conditions.

Quoting - Raf Schietekat
"I have to stick to the atomic data type, that is, all modifications are atomic and I also have to overload methods that should deal with this data type."
What exactly do you mean by that?

In my application I have a method with the following signature void convolve(double*). Within this method I use parallel_for to perform a convolution on the argument field. No race conditions occur in this methods.
Beforehand, the same field is updated by multiple threads concurrently. That is why I changed the field's type to atomic*. Consequently, I also had to change my method convolve(atomic*), though atomicity is not necessary within convolve. I have not tested it yet, but I suppose this will also considerably decrease this methods performance.
Maybe this is an advantage of OpenMP: atomicity is only specified when necessary.

Quoting - Raf Schietekat
"Anyway, when will this path become part of any TBB release? Is there still no need for atomic floats?"
Intel's compiler has some C++0x features already (I'd have to check about atomics), so maybe that's why an integration of this into mainline TBB is perceived as less urgent; I have no idea when C++0x features can be expected to be widely available. It's up to potential users like yourself to express any need for atomic floats.

Please, excuse my ignorance, but how are atomics related to C++0x?

Btw. I use the Intel C/C++ compiler version 10.1.

0 Kudos
RafSchietekat
Valued Contributor III
462 Views
#4 "And I hope the patch is compatible with TBB 2.2."
Care to sponsor an update? :-)

#5 "Please, excuse my ignorance, but how are atomics related to C++0x?"
There's a new "atomic operations library".

#5 "Btw. I use the Intel C/C++ compiler version 10.1."
I defer to Intel for details.
0 Kudos
robert_jay_gould
Beginner
462 Views
Oh yeah Raf is an angel :)

I asked the same about a year ago, and posted some crazy atomic prototypes of my own, anyways the addition of atomic floats is really nice, it's a really useful feature.

Thanks to Raf for actually implementing them correctly (I hope!)

However one curious bit of information, that came up at the time, is the protecting your floats with a user-space spin-lock results in practically the same performance. Using a spin-lock will probably make everything safer (if you are running on some edge-case platforms), but having atomic makes the whole thing easier to work with.

But in either case atomic don't get pipelined effectively (I think), so they perform about 4 times slower than a naked float.

This means depending on your situation they might be a good option, or you might need to rethink your atomicity/locking scheme.
0 Kudos
RafSchietekat
Valued Contributor III
462 Views
(Silly stuff removed.)

"However one curious bit of information, that came up at the time, is the protecting your floats with a user-space spin-lock results in practically the same performance. Using a spin-lock will probably make everything safer (if you are running on some edge-case platforms), but having atomic makes the whole thing easier to work with."
Locks take up room (we're talking about one lock per float, right?), they might cause convoying (when oversubscribing, so typically not in a proper TBB application), and you have to consistently apply them (unless you encapsulate them with the float into what amounts to an atomic), and not involve them in other things (or be careful not to cause a deadlock). If the operation is not already provided, using atomics requires writing a compare_and_store loop that seems strange if you're used to manipulating locked data, and you might perhaps be starved if competing against lots of simple manipulations. I don't see a clear winner, but it seems better to have a choice to fit the circumstances.

"But in either case atomic don't get pipelined effectively (I think), so they perform about 4 times slower than a naked float."
You mean compared to a float by itself, without a lock, I hope? I haven't benchmarked it yet. Seems awful, doesn't it? But #2 says it's better than locked floats, so I'm not too concerned.

"This means depending on your situation they might be a good option, or you might need to rethink your atomicity/locking scheme."
If you can amortise the cost of locking somehow, atomic floats will surely lose. Or did you mean something else?

P.S.: Glad to hear you like it.

(Silly stuff removed.)
0 Kudos
jimdempseyatthecove
Honored Contributor III
462 Views

pkegel, Raf, Robert,

The problem with atomic, be they float, double, int, etc... is the programmers tendency to overuse the atomic in places where and whenit is not required. This cause sever performance penalties (> 100x) in many cases.

Take pkegel requirement to update a large dataset using parallel_for. In many cases, only the boundary cells require atomnicity while everything in between does not.

In other cases, such as particle interaction, it is much more efficient to partition the data and work on independent partition pairs in a non-interfering manner. This will require rewriting a 3 line loop in 300 lines (once) but those 300 lines will run > 100x faster than using all atomics.

The question is, do you invest some time (once) in programming effort, or do you wast time waiting for results ever after?

Jim Dempsey
0 Kudos
RafSchietekat
Valued Contributor III
462 Views

Which would be most appropriate: differentiate between double for interior cells and atomic for boundary cells in the user code, or have uniform arrays of atomic and differentiate at the operation level? If the latter, it would appear that there should be a difference between "relaxed" (atomic in the original sense of the word but no associated memory semantics relative to other operations) and "exposed" or so (indistinguishable from a plain double, and equally vulnerable to data races). For integer types, there is no clear advantage in distinguishing these, on most architectures, but floating-point data typically has to be shuttled around through an integer register to acquire atomic characteristics. The way I implemented atomics, you could have an array of exposed_atomic, where only the boundary cells would do things like store(3.14) to override the default "exposed". Just kicking the idea around: I don't have the user experience to validly choose one way or the other.

0 Kudos
RafSchietekat
Valued Contributor III
462 Views
Hmm, I could probably optimise things somewhat, at least for storing or loading a double on a 32-bit x86. Now a store() argument gets cast into a 64-bit integer, then it's copied into a floating-point register, and then it's stored. The last two steps are standard for 64-bit integers because they are presumedly cheaper than "lock cmpxchg" where no ordered semantics are required, but with "double" data only the last step needs to be kept. Reverse the steps for the load situation, and some other operations may be similarly optimisable.
0 Kudos
Reply