- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The discussion in http://software.intel.com/en-us/forums//topic/58670raises the point that TBB's atomic operations are lacking:
- A way to express an unfenced atomic load
- A way to express a full fence (i.e. fence with both acquire and release semantics)
(2) can be expressed pathetically as atomic
I'm wondering about opinions on syntax and generality for the additions suggested above. The simplest approach would be to add:
- T atomic
::unfenced_load() const - void atomic
::fence() const
Another approach would be to add T atomic
Would the "simplest approach" be enough?
- Arch
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
policy->store,load,fetch_and_add
relaxed->relaxed,relaxed,relaxed
rel_acq->release,acquire,rel_acq
ordered->ordered,ordered,ordered
The atomic operations themselves are declared and documented in tbb_machine.h. "ordered" was used in earlier discussions, together with "raw", and I picked one of these because it looks less awkward than "seq_cst" while dumping "raw" for "relaxed" because "relaxed" has 7 characters like the others (duh) and I didn't want to totally reject the C++0x names, even though a programmer shouldn't be too relaxed about using "relaxed". I thought acq_rel is too confusing: first you release, then you do the basic operation, then you acquire, so it's best not to disturb that sequence in one's mind, by having rel_acq instead. Also, most names in TBB are different than what C++0x now proposes anyway (not having the _and_ that TBB uses).
BTW, I'm now thinking about giving the tbb/machine headers more access to __TBB_fence_guard itself, because I'm still not happy about how it would be used on other architectures than linux_ia32.h. And right now I'm having some fleeting doubts whether template specialisation does more than using macros... but that's how it grew (removing the atomic_traits layer and going straight to a tier of simple templates); I'll give it some more thought, though.
(Added) In the spirit of full disclosure, here's another problem: currently default delegation keeps the same memory semantics, but that doesn't really work for linux_itanium.h, where, e.g., store
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Making rel_acq the default instead of sequential consistency is quite dangerous, because realistically almost all machines (except Power) will implement rel_acq the same as sequential consistency.Sounless you have a Power machine, there will be no way to truly test the code.
Thanks for your comment in our contribution about AtomicBackoff:
// TODO "Why is this often used in a compare-and-swap loop
// where the comparand is from the previous loop iteration?
// It would seem that this makes the new iteration more
// likely to fail because the underlying value has changed,
// thus costing extra iterations that are ever more
// likely to be interfered with.
I think what happened is that we timed benchmarks that had mostly very short waits, and there the reload seemed to slow things down. But after long waits, it's surely suboptimal to reuse the old value. I'll sned your comment around to the group here to make sure everyone sees it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This assumes that the atomic is used without also using store
But I think such a decision should be based on a fair set of examples (of the very specific kind described in the second paragraph), not on a desire for fool-proofness that I think is misguided/short-sighted for reasons explained above. Disclaimer: I don't have the experience to know about such examples.
(Correction) Actually, I don't know which of SPARC's RMO/PSO/TSO modes would differentiate rel_acq vs. ordered, but quite likely none of them, so I withdraw the remark.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
(Correction) Improved but not yet successful, so only for the curious.
(Added after next post) File seems to have disappeared, but please use file in next post instead.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Apparently I had overestimated the power and reach of SFINAE (unless it's just a g++ limitation?), so I had to subspecialise some operations for ordered to use the default delegation anyway; an alternative might be to be more precise (and verbose) on the specialisations, thus excluding ordered in the first place.
I surrendered on the issue of not being able to mix partial specialisation and parameter defaults, so now I have atomic{,_relaxed,_rel_acq,_ordered}, with atomic behaving like atomic_rel_acq (anyone wanting to mix release, acquire and ordered might add atomic_mixed to the mix, but I'm still waiting for a good validating example); feel free to educate me on what I may have overlooked.
Still to do: more tests in test_atomic.cpp, finish other ports than linux_ia32.h and mac_ppc.h, but mainly I'm quite happy with this, so it should be mature enough for deeper scrutiny.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Mainly laid the foundations for SPARC support (see gcc_sparc.h), but I have to wait until Tuesday to validate and test it, so feel free to comment on it first (I welcome corrections and test results).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've just had time to read
One problem I see is an elegance versus practical efficiency issue with the return type on operator&=, operator|=, and operator^=. So far, atomic
If there is no hardware support for fetch-and-{and,or,xor}, it might make more sense to generalize bitwise fetch-and-op to a template memberfetch_and_op(binaryFunctor) that does an arbitrary fetch-and-op.
So far, the only client in TBB of bitwise atomic ops is spin_rw_mutex, and it definitely wants the efficient x86 implementations that do not return a value. I suppose we could continue to support those ops as __TBB_Atomic{AND,OR} macros.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The __TBB_fence_guard is a clever trick. We'll have to inspect whether it has an impact on performance. Alas not-so-clever compilers can generate a lot of gratuitous extra code for sake of calling a destructor in event of an exception, even if an exception is not possible. Only inspection of the binary code will tell us if this is an issue or not. If code quality becomes an issue, we can break __TBB_fence_guard into two functions, one for entry and one for exit, which would be a slighly less elegant style but still retain genericity.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raf_Schietekat:Latest and greatest (also officially contributed), based on tbb20_20080408oss_src (most recent stable release), tested for linux_ia32.h (except that I merged this with linux_emt64t.h into gcc_x86_64_32.h) and mac_ppc.h.
Mainly laid the foundations for SPARC support (see gcc_sparc.h), but I have to wait until Tuesday to validate and test it, so feel free to comment on it first (I welcome corrections and test results).
Sorry for stupid question, but how I can view this change? Latest development release dated by 'May 14, 2008'. And I don't see cvs/svn access...
Dmitriy V'jukov
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Adding add/subtract/and/or/xor to atomic would be easy: __TBB_Op is already in place to support atomic::store (the same pattern can be used to link it up for the others) and has been implemented for __TBB_op_{store,add,and,or,xor} where possible to support __TBB_Atomic{AND,OR} as well as store and also because I'm a pedantic orthogonalist, so I'm definitely not against the idea.
I don't see a performance benefit for operations that don't need the existing value, only a register allocation benefit: the x86, as a CISC instruction set, is tailored to this "linguistically", and it has few (visible) registers, but internally it will still have to fetch the old value first, so in a way it is ironical that it only has instructions that discard that old value. Which (other) processors have I overlooked that do fetch-and-add better than fetch-and-{and,or,xor}? Well, there's also the CISC vs. RISC thing (instruction count), but that's a separate issue, I think; and I don't really know why not more processors allow locking the bus (seems rather greedy, though) or an address (using cache coherence hardware), but maybe there would be insufficient benefit over looping to consider that.
Generalizing is exactly what goes on underneath (in __TBB_(Fetch)Op), but it looks a bit heavy syntax-wise to expose that to the user. I'm not sure whether doing the same for a general user-supplied function would be beneficial, because there is some syntactic overhead to a functor that may offset the benefit of not having to write a simple compare-and-swap loop, and how often would it be used anyway (and if wanted, it can always be added as an external template operation). I'm only hesitating about whether POWER/PowerPC can do anything with reservations to leverage an intrusive design.
The question seems to be whether atomic should be the basis for locks or not. If they should be, there's a design problem with platforms that don't have self-contained atomic support, although I can't immediately think of any that are not nearing end of life (PA-RISC). Otherwise it seems fine to continue to rely on a layer that is not exposed in atomic, and the (non-rhetorical) question becomes what other purposes non-fetch add/subtract/and/or/xor would serve.
I do worry a bit about performance, but for the moment only about non-optimised builds that do not optimise away calls to inline functions. GNU has __attribute__((always_inline)) or somesuch to come to the rescue if needed; I don't know about the others. But I have never considered compilers, if any, that would generate code where nothing is needed for stack unwinding: surely after inlining they would come to their senses?
A question relating to locking the bus or not: to implement various operations, RISC processors will typically loop over compare-and-swap without something like TBB's AtomicBackoff, or that is what manuals suggest anyway. Until now I took it on faith that AtomicBackoff is like what Ethernet does for collision avoidance, but how certain are you that it is beneficial for short unblockable operations? Maybe the maximum pause should be limited or even be variable instead of ever-increasing (with its clandestine biblical bias: the first will be the last and the last shall be the first)? And one step further (maybe one too far?), would be the question whether the reportedly unfavorable test results for a skip list map included such a backoff on the atomics or not?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
cp $(FROM)/gcc_power.h cp $(TO)/include/tbb/machine
cp $(FROM)/gcc_sparc.h $(TO)/include/tbb/machine
cp $(FROM)/gcc_x86_64_32.h $(TO)/include/tbb/machine
cp $(FROM)/ibm_aix51.h $(TO)/include/tbb/machine
cp $(FROM)/linux_common.h $(TO)/include/tbb/machine
cp $(FROM)/linux_em64t.h $(TO)/include/tbb/machine
cp $(FROM)/linux_ia32.h $(TO)/include/tbb/machine
cp $(FROM)/linux_itanium.h $(TO)/include/tbb/machine
cp $(FROM)/mac_ppc.h $(TO)/include/tbb/machine
cp $(FROM)/windows_em64t.h $(TO)/include/tbb/machine
cp $(FROM)/windows_ia32.h $(TO)/include/tbb/machine
cp $(FROM)/windows_ia32_inline.h $(TO)/include/tbb/machine
cp $(FROM)/atomic.h $(TO)/include/tbb
cp $(FROM)/tbb_machine.h $(TO)/include/tbb
cp $(FROM)/atomic_support.c $(TO)/src/tbb/ibm_aix51
cp $(FROM)/queuing_rw_mutex.cpp $(TO)/src/tbb
cp $(FROM)/tbb_misc.h $(TO)/src/tbb
cp $(FROM)/test_atomic.cpp $(TO)/src/test
cp $(FROM)/test_compiler.cpp $(TO)/src/test
cp $(FROM)/test_rwm_upgrade_downgrade.cpp $(TO)/src/test
Actually that's from my current list after I changed gcc_x86.h back to the rather silly gcc_x86_64_32.h still used in the zip file; I hope I didn't forget anything else.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raf_Schietekat:I'm sure I'm misinterpreting Robert's last post, but just in case: you only need Downloads>"Stable Release">tbb20_20080408oss>tbb20_20080408oss_src.tar.gz and "TBB Discussion Forum">"Additions to atomic">{atomic.Raf_Schietekat.20080529.zip from "05-29-2008, 12:08 PM" and patch script from item 32 which is the second post on the second page that does not have an absolute timestamp yet}.
Oh... now I see attachment to the post :)
Ok, thank you. I will try to combine the release and the patch.
Btw, concerning fetch-and-{and,or,xor}.
In my implementation of atomics I provide 2 versions of each function:
value_t fetch_add(value_t v);
void add(value_t v);
value_t fetch_and(value_t v);
void and(value_t v);
and so on...
I'm still not sure whether it makes some sense...
Dmitriy V'jukov
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry for the confusion, Raf. I'd seen that you had posted at least one of your atomic
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Better SPARC support (still to be tested), first draft for Alpha support.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Locked atomics (currently only used for 64 bits on 32-bit PowerPC), hopefully more complete support for Itanium, successful test for SPARC (well, at some point during the making of this version anyway), various things.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raf_Schietekat:
Locked atomics (currently only used for 64 bits on 32-bit PowerPC)
And what about early x86-64 systems, which doesn't support double-word cmpxchg instruction? ;)
Dmitriy V'jukov
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Anyone, feel free to test anything other than g++ on 32-bit x86 and 32-bit PowerPC (those I can easily do myself, so they'll only need final validation later on, except that for the latter I could only build part of TBB). I did a test of SPARC at some point, but for the others I can only hope that I made no mistakes (it's not impossible, but...). In particular, Itanium needs some attention (totally revised), but maybe you want to use an Alpha machine a little while longer (not yet in TBB); I don't know how much interest there would be in PA-RISC (which would require a revision of some TBB assumptions because unlocked semaphores are non-zero), or in MIPS (I could not easily find documentation for it)? I have merged with tbb21_20080605oss by now, if anyone needs that instead.
P.S.: Hey, I reached a century (or how is that in cricket speak?)!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Not really interesting, just an upgrade to a new TBB stable release version, and 64-bit atomics didn't actually work yet on 32-bit PowerPC in the previous version (I was away from my Mac mini and made a few too many assumptions). The one interesting thing is that I've found that... it doesn't work for a release build on x86 with g++: test_atomic.exe fails for 64-bit atomics, and I don't know why. It does work for a debug build, and for both builds on Mac OS X, but it fails for release on Ubuntu 6.06 for real or on 8.04 in Sun xVM VirtualBox. Of course I'll keep trying to find out why (an extra problem is that sometimes the problem disappears when I add diagnostics, although it seems to be perfectly reproducible between runs if the code is the same), but if you'd like to, e.g., compile to assembler and investigate that to tell me what I missed (the relevant section should be fairly short), please go right ahead.
Oh yes, the revised Itanium code still has never been tested, and any other useful comments are welcome as well.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raf_Schietekat:Dmitriy, feel free to contribute what I need to know (I suppose you mean quadword, because doubleword seems to be 32 bits on IA-32/Intel 64). I already merged the 32-bit and 64-bit variants of the x86 architecture (as described in Intel's documentation as IA-32/Intel 64), so it should be relatively easy to fit in an intermediate version. I just need to know the relevant predefined macros and what subset is available (everything except 64-bit cmpxchg?), I suppose (unless it's not a subset, in which case I need some more documentation).
Sorry for acting like a$$ :)
I mean that some early x86-64 chips lacks cmpxchg16b instruction (128-bit, double-word CAS, in terms of 64-bit platform). And you can't detect cmpxchg16b support statically, you can detect it only in run-time, by analysing values returned by CPUID (eax=1) instruction. So you can't just define some macros...
Dmitriy V'jukov
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page