Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Additions to atomic<T>

ARCH_R_Intel
Employee
4,898 Views

The discussion in http://software.intel.com/en-us/forums//topic/58670raises the point that TBB's atomic operations are lacking:

  1. A way to express an unfenced atomic load
  2. A way to express a full fence (i.e. fence with both acquire and release semantics)

(2) can be expressed pathetically as atomic::fetch_and_add(0).

I'm wondering about opinions on syntax and generality for the additions suggested above. The simplest approach would be to add:

  1. T atomic::unfenced_load() const
  2. void atomic::fence() const

Another approach would be to add T atomic::load() const, and allow M to specify acquire, release, full fence, or unfenced. But my opinion is that drags inmuch pointless complexity, because it adds "full_fence" and "unfenced" to the possibilities for M that is a template parameter for other atomic ops too, and the "release" variant for a load seems of dubious value.

Would the "simplest approach" be enough?

- Arch

0 Kudos
154 Replies
RafSchietekat
Valued Contributor III
635 Views
That might indeed be relevant to any support for generic atomic objects (absent built-in 128-bit integral types), but the decision would (also?) need to be made at development time (if the instruction is not available, the atomic's implementation would have to be based on a lock, doubling the size, although the lock might still be disregarded at runtime for better performance even if the memory is already wasted). It's too late in the day for me now to go check the alignment issues, but I do have an uneasy feeling about that, so it might still all be irrelevant after all.
0 Kudos
RafSchietekat
Valued Contributor III
635 Views
Latest and greatest (also officially contributed), based on tbb21_20080605oss (most recent stable release), tested for x86/Ubuntu 6.06/g++ 4.0.3 (only), see README for applying the patch.

Found a simple but strange remedy for the problem with 64-bit atomics on x86, removed AtomicBackoff altogether (was that a good idea?), some new tests, tracked TBB examples/Makefile change for SPARC.

(Added) It would seem that nobody is interested, for why else would there have been not a single reaction to the removal of AtomicBackoff (which I am hereby rolling back)?

0 Kudos
RafSchietekat
Valued Contributor III
635 Views
Latest and greatest (also officially contributed), based on tbb21_20080605oss (most recent stable release) or tbb21_009oss (most recent commercially aligned release) which differ only in ./CHANGES, tested for x86 (dual core)/Ubuntu 6.06/g++ 4.0.3 and PowerPC 32-bit (single core)/Mac OS X 10.4.11/g++ 4.0.1, see README for applying the patch.

Improved spin lock (see "Spinning"), added support for MIPS and some promising code for PA-RISC/HP-UX/g++ (some tests were not successful).

(Added) Please note that the PA-RISC code differs from the time the tests were run and will not build. An improved version is in the works.
0 Kudos
landmann
Beginner
635 Views
I am quite a beginner with TBB and was wondering, where the mentioned atomic is used actually? Is it in the tbb21_009oss included? I am doing some tests on SMP ARM to compare it with my own scheduler.
I started to port using the powerpc header but it only implements cmpxchg atomics with a full barrier semantic. As these sync calls may become visible on ARM I had a look at the IA64 implementation but found no template either, only fully referenced externals....quite a lot of them (too much for a quick test, so I will stay with the ppc approach).
Has someone thought about using libatomic_ops ? It already covers a lot of architectures and provides semantics but is plain-C only.

Regards!

--L
0 Kudos
RafSchietekat
Valued Contributor III
635 Views
If you would bear with me for another day or so, I will include support for ARM (after some final touches for 64-bit atomics) in a new version that will correct buildability on PA-RISC.
0 Kudos
landmann
Beginner
635 Views
Sounds good. Sorry for not having seen at the first time that atomics is a patch to TBB. I will use the simplest variant (just a full barrier cas) until then and if more advanced atomics become available via a patch I already have something for benchmarking and comparing both variants.
However, you wrote that your atomics differs from the proposal for C++0x. I don't want to discuss this here, but in case the proposal by Boehm/Crowl makes it into C++, is there a _technical_ issue which prevents it from being used in TBB?
If you are interested , feel free for some off-list discussions (also ARM specific if needed).

Regards!

--L
0 Kudos
RafSchietekat
Valued Contributor III
635 Views
I'm going to mull over PA-RISC some more, but I did promise something for ARM, so here it is (on top of the 2008-08-03 patch, not yet officially contributed so no peeking from TBB staff smiley [:-)], no changes since previous post so no 64 bits yet). If you are familiar enough with the issues please have a look at the fence-related stuff in gcc_arm.h, and let me know about any problems (you may need to officially contribute any nontrivial suggestions).

0 Kudos
landmann
Beginner
635 Views
Wow, that looks astonishing. I need some time to go through it.
I really like the scoped fence construction!

One issue already exists: The atomic store operations should not be done using a simple store but using a ldrex/strex to clear the exclusive monitor. At least as long as Linux kernel does not guarantee to clear the monitor always on a context switch. There have been some discussions in the lists but honestly I do not know the current state of mainline.
You mentioned not 64bit yet. Does TBB run without it? Or is it simulated lock-based?
I don't understand the sign extension problematic you raised....as long as the compare instruction is used with the proper 'width' it should not matter whether the upper parts of the registers are sign or zero extended, shouldn't it?
What about using gcc's built-ins for log2 ? It would need only one declaration for the whole gcc-based includes.

Thank you!
--L
0 Kudos
Alexey-Kukanov
Employee
635 Views

Landmann:
What about using gcc's built-ins for log2 ? It would need only one declaration for the whole gcc-based includes.

In general, we avoid using compiler built-ins in TBB, because the code will be built by different versions of different compilers (gcc 3.2.x to 4.3.x, and Intel C++ Compiler9.0 to 10.1, to name the least). That's why inlined assembly is preferred. For a particular platform where a zoo of different compilers is not a problem, using intrinsics might be totally acceptable.

0 Kudos
RafSchietekat
Valued Contributor III
635 Views
Store and load are as they should be (you don't need an exclusive store to clear the monitor). Thanks for drawing my attention to the fact that ARM dropped the ball on clearing the monitor for abrupt control flow changes (how about a recall?); it then falls on the code taking control (in this case the Linux kernel) to correct that, because there's nothing a user can do to absolutely guarantee correctness, is there, except perhaps do silly things to improve the odds at the expense of performance? 64-bit support can be done without a lock (I'll work on it over the weekend); if you want to try this now then reverse the strex result flag test (I had not noticed yet that it flags failure instead of success) and disable the 64-bit test in test_atomic (TBB only needs 64-bit atomics if a pointer is that size). Sign extension issues are because "cmp" has no width argument, it always compares entire registers. Reliance on g++ intrinsics is not a net gain for TBB.

(Added) Please use new gcc_arm.h snapshot for testing (also disables 64-bit test in test_atomic), on top of previous diff.
0 Kudos
landmann
Beginner
635 Views
Thank you, that makes sense if "old" compilers have to be supported.

Besides: I managed to run TBB on ARM using only the cmpxchg implementations. However, release builds tend to freeze after a couple of seconds, debug builds just need some more time. I might be related to the issue I mentioned previously. Am I right that the standard TBB does not provide a way of specifying operations for atomic stores for all sizes (1,2,4,8) ?

I am out to check with Raf's patch now, but it needs some more work.
For instance, why do I have to specify type and size of atomic operations for various types separately?:
__TBB_M(1,int8_t ,"b")
__TBB_M(2,int16_t,"h")
__TBB_M(4,int32_t,"" )
The first argument should always be sizeof() the second here, shouldn't it?

--L


0 Kudos
landmann
Beginner
635 Views
Regarding cmp, you are totally right. I was mislead by your code acutally, because it allowed size modifiers to append the cmp mnemonic. Now I suppose that was just to break compilation and catch problematic cases.

Thanks for providing an update, I will see how far I get.

Regards!
0 Kudos
Alexey-Kukanov
Employee
635 Views

Yes you are right; in TBB 2.0 and 2.1 the stores are assumed to be atomic (except for the64-bit store on IA-32, which is implemented explicitly), while fences are missed.

Both type and size are specified because (at least, in the vanilla TBB) the size digit becomes a part of a function name by using ## in the macro.

0 Kudos
landmann
Beginner
635 Views
Raf, here comes an updated version for ARM.
- it compiles without having the assembler claiming about not-allowed registers, etc
- it allows TBB to run my example without freezing
- no 64 bit support included, but I consider it useful (ABA prevention things)

So in case you need it I vote for using your approach in TBB (at least for the ARM side).

--L
0 Kudos
RafSchietekat
Valued Contributor III
635 Views
Thanks, and sorry for the bad proofreading (good catch on the test for exclusive-store failure!); I'll try to do better for the 64-bit code still to be added. I did restore the double duty of "intended", however, and rolled back the "memory" clobber (that's what "m"(*(T*)ptr) is for, because clobbering "memory" may prevent some valid optimisations, theoretically at least).

0 Kudos
landmann
Beginner
635 Views
As I do not know enough about the internals of gas+gcc I was afraid that an unused/unreferenced output parameter ("m"(*(T*)ptr)) might be ignored. Due to the bad experience with the 'stock' version I used the memory clobber for safety. But I totally agree that it lacks the possibility to specify a distinct address. So if the distinct output is really working, it is clearly the better solution.
Now going to play with Realview Compiler and TBB...

--L
0 Kudos
RafSchietekat
Valued Contributor III
635 Views
Latest and greatest (also officially contributed), based on tbb21_20080605oss (most recent stable release) or tbb21_009oss (most recent commercially aligned release) which differ only in ./CHANGES, tested for x86 (dual core)/Ubuntu 6.06/g++ 4.0.3 and PowerPC 32-bit (single core)/Mac OS X 10.4.11/g++ 4.0.1, see README.atomic for applying the patch.

Some traction on PA-RISC/HP-UX/g++ (most of the tests and examples should work), added support for ARM/g++ (64-bit support ready for testing) and {ESA/390,z/Architecture}/g++, parallel_for examples now show up on PowerPC-based Mac OS X (x86-specific flags now only selectively added).

Question: why does (only) the polygon overlay example remain black on x86/Ubuntu?

0 Kudos
RafSchietekat
Valued Contributor III
635 Views
Latest and greatest (also officially contributed), based on tbb21_20080605oss (most recent stable release) or tbb21_009oss (most recent commercially aligned release) which differ only in ./CHANGES, tested for x86 (dual core)/Ubuntu 6.06/g++ 4.0.3 and PowerPC 32-bit (single core)/Mac OS X 10.4.11/g++ 4.0.1, see README.atomic for applying the patch.

Scalable memory allocator now works on PA-RISC/HP-UX/aCC.

(Added) Please note that I have abandoned my goal of porting TBB itself to aCC, a poster child for open source.

0 Kudos
RafSchietekat
Valued Contributor III
615 Views
Latest and greatest (also officially contributed), based on tbb21_20080605oss (most recent stable release) or tbb21_009oss (most recent commercially aligned release) which differ only in ./CHANGES, tested for x86 (dual core)/Ubuntu 6.06/g++ 4.0.3 and PowerPC 32-bit (single core)/Mac OS X 10.4.11/g++ 4.0.1, see README.atomic for applying the patch.

Some corrections and ideas for the scalable memory allocator, and tracking some changes by Intel's TBB team.

Corrections include: alignment guarantees, portability.

The ideas are about decreasing memory overhead (adding bins for sizes 2 and 4, eliminating most of the overhead for allocation sizes of about 8000 to 16000 bytes, potentially decreasing overhead because of the alignment needs for large objects, or eliminating those needs altogether). Because it is not (always) obvious how this will affect performance and whether it is even effective, preprocessor switches control which changes are active, and you are invited to evaluate the results (see __TBB_USE_... in src/tbbmalloc/MemoryAllocator.cpp). The __TBB_USE_MAGIC option is incomplete: first the safe probing should be validated for this to be a viable approach.

0 Kudos
RafSchietekat
Valued Contributor III
615 Views
Latest and greatest (also officially contributed), based on tbb21_20080605oss (most recent stable release) or tbb21_009oss (most recent commercially aligned release) which differ only in ./CHANGES, tested for x86 (dual core)/Ubuntu 6.06/g++ 4.0.3 and PowerPC 32-bit (single core)/Mac OS X 10.4.11/g++ 4.0.1, see README.atomic for applying the patch.

Some new atomic functionality (compare-and-swap with spurious failure) but with a twist (different signature, pedantic orthogonalism).

Spurious failure isn't as bad as the name would indicate: it just means that the function does not have to keep trying (spinning) until it can come up with a better excuse than that the other threads got in the way, i.e., the observed value does not have to differ from the expected value. On some architectures this does not matter (x86 gets/simulates a lock and either succeeds or fails because the observed value is actually different), on others it does, and where it matters one might as well unravel the compound loop of spinning on a spinning compare_and_swap, which also should allow for a more rational backoff policy (or so I hope).

The twist is to use a signature

bool compare_and_store(value_type &comparand, value_type intended, bool wait = true);

where the observed value is exchanged for the expected value in "comparand", instead of being the return value (TBB style) or being exchanged for the intended value as is often done (it seems more user-friendly to me to keep the "comparand" continuously updated, with x86 and ESA/390-z/Architecture as a precedent), and the "wait" parameter determines whether spurious failure is disallowed (the default mimics compare_and_swap), and to make this part of an orthogonal {compare,fetch}_and_{store,add,and,or,xor}() family of operations (just for kicks, I also added {add,and,or,xor}_and_fetch() with "prefix" functionality; the one with store would probably not make sense even though it does the same as operator=), with compare_and_swap now somewhat superfluous but not yet deprecated (pending evaluation, but some uses have already been replaced with a specific expectation to improve performance).

Here's a small illustration in pseudo code, ignoring backoff, memory semantics, etc., showing a default implementation for one of the other operations:

int fetch_and_add(int addend) {
 int comparand = load(); // expected/observed, initial value not critical for correctness
 while( !compare_and_store( comparand, comparand+addend, irrelevant ) );
 return comparand;
}

As a mental exercise, try to rewrite this with the observed value either returned or replacing the intended value.

So, what do you think?

0 Kudos
Dmitry_Vyukov
Valued Contributor I
615 Views
I think it's great!
IIRC, CAS that updates comparand called something like 'IBM-style'.

What I am missing in current C++0x is functions like atomic_{add,and,or,xor}, w/o fetch at all.
On x86 architecture 'lock' prefix can applied to instructions like add, and, or, xor. And in some algorithms this is exactly what is needed. I don't need previous value, I don't need 'CAS-loop', I just need atomic operation.

Btw, if you want to make interfaces more C++ish and have more guarantees about static dispatch, you can consider following interfaces:
// how this is made in current C++0x (or Java):
bool compare_and_store_weak(value_type &comparand, value_type intended);
bool compare_and_store_strong(value_type &comparand, value_type intended);
or:
// or just static dispatch based on overloading:
bool compare_and_store(value_type &comparand, value_type intended); // strong version
bool compare_and_store(value_type &comparand, value_type intended, bool /*fake*/); // weak version

0 Kudos
Reply