Re: Additions to atomic<T> - Page 2

ARCH_R_Intel · ‎05-09-2008

The discussion in http://software.intel.com/en-us/forums//topic/58670raises the point that TBB's atomic operations are lacking:

A way to express an unfenced atomic load
A way to express a full fence (i.e. fence with both acquire and release semantics)

(2) can be expressed pathetically as atomic::fetch_and_add(0).

I'm wondering about opinions on syntax and generality for the additions suggested above. The simplest approach would be to add:

T atomic::unfenced_load() const
void atomic::fence() const

Another approach would be to add T atomic::load() const, and allow M to specify acquire, release, full fence, or unfenced. But my opinion is that drags inmuch pointless complexity, because it adds "full_fence" and "unfenced" to the possibilities for M that is a template parameter for other atomic ops too, and the "release" variant for a load seems of dubious value.

Would the "simplest approach" be enough?

- Arch

RafSchietekat · ‎05-22-2008

Sorry, these comments are about an unrealised idea for tbb::atomic: a policy for setting default memory semantics (with its own default being rel_acq). I couldn't get past the incompatibility of default template arguments and partial specialisation, on g++ 4.0.3 at least, so I'm still thinking how that could be implemented. Here's how it would apply to some atomic operations (the policy names are recycled from memory_semantics to coincide with how they would affect something like fetch_and_add):

policy->store,load,fetch_and_add
relaxed->relaxed,relaxed,relaxed
rel_acq->release,acquire,rel_acq
ordered->ordered,ordered,ordered

The atomic operations themselves are declared and documented in tbb_machine.h. "ordered" was used in earlier discussions, together with "raw", and I picked one of these because it looks less awkward than "seq_cst" while dumping "raw" for "relaxed" because "relaxed" has 7 characters like the others (duh) and I didn't want to totally reject the C++0x names, even though a programmer shouldn't be too relaxed about using "relaxed". I thought acq_rel is too confusing: first you release, then you do the basic operation, then you acquire, so it's best not to disturb that sequence in one's mind, by having rel_acq instead. Also, most names in TBB are different than what C++0x now proposes anyway (not having the _and_ that TBB uses).

BTW, I'm now thinking about giving the tbb/machine headers more access to __TBB_fence_guard itself, because I'm still not happy about how it would be used on other architectures than linux_ia32.h. And right now I'm having some fleeting doubts whether template specialisation does more than using macros... but that's how it grew (removing the atomic_traits layer and going straight to a tier of simple templates); I'll give it some more thought, though.

(Added) In the spirit of full disclosure, here's another problem: currently default delegation keeps the same memory semantics, but that doesn't really work for linux_itanium.h, where, e.g., store could go to store instead of to a loop of compare_and_swap by way of fetch_and_store (conceptually). The reason is that the __TBB_fence_guard policy normally takes care of everything below ordered, and, having no access to tools/generate_atomic/ipf_generate.sh, I'm assuming that it might be easier to just make it conform to this pattern (relaxed being new and all) than to break the pattern.

ARCH_R_Intel · ‎05-22-2008

Making rel_acq the default instead of sequential consistency is quite dangerous, because realistically almost all machines (except Power) will implement rel_acq the same as sequential consistency.Sounless you have a Power machine, there will be no way to truly test the code.

Thanks for your comment in our contribution about AtomicBackoff:

// TODO "Why is this often used in a compare-and-swap loop
// where the comparand is from the previous loop iteration?
// It would seem that this makes the new iteration more
// likely to fail because the underlying value has changed,
// thus costing extra iterations that are ever more
// likely to be interfered with.

I think what happened is that we timed benchmarks that had mostly very short waits, and there the reload seemed to slow things down. But after long waits, it's surely suboptimal to reuse the old value. I'll sned your comment around to the group here to make sure everyone sees it.

RafSchietekat · ‎05-22-2008

"Making rel_acq the default instead of sequential consistency is quite dangerous, because realistically almost all machines (except Power) will implement rel_acq the same as sequential consistency. So unless you have a Power machine, there will be no way to truly test the code."

This assumes that the atomic is used without also using store and load, or where that use is not a problem while the operations using rel_acq (fetch_and_store etc.) are used with a fatally mistaken assumption of sequential consistency: otherwise the problems with store and load would draw the attention to the problem. And sequential consistency is very expensive as a default... but I'm preaching to the choir now, because that's what I always heard from the Intel (TBB) side, I think (C++0x is not talking about a policy release,acquire,ordered: it's seq_cst pervasively). Taking it as the default makes it even more urgent to allow a different defaults policy to be plugged in, to avoid having to sprinkle explicit semantics around the code to get performance back, which can only lead to more bugs, not fewer.

Once there's a SPARC port (Solaris now works only with IA-32/Intel 64), it would also be potentially affected if RMO (Relaxed Memory Order) is configured, I guess (although I don't know what would be normal practice).

But I think such a decision should be based on a fair set of examples (of the very specific kind described in the second paragraph), not on a desire for fool-proofness that I think is misguided/short-sighted for reasons explained above. Disclaimer: I don't have the experience to know about such examples.

(Correction) Actually, I don't know which of SPARC's RMO/PSO/TSO modes would differentiate rel_acq vs. ordered, but quite likely none of them, so I withdraw the remark.

RafSchietekat · ‎05-24-2008

Improved atomics proposal (also officially contributed), tested for linux_ia32.h and now also for mac_ppc.h, still based on tbb20_20080408oss_src (current stable release). It settles the __TBB_fence(_guard) problem I still had, and applies the changes to most of the supported platforms (except for verification). Some things still to do: unit tests for bitwise operations etc. (if anyone would like to do that?), configurable memory semantics defaults policy (I don't know the path to a solution yet).

(Correction) Improved but not yet successful, so only for the curious.

(Added after next post) File seems to have disappeared, but please use file in next post instead.

RafSchietekat · ‎05-26-2008

Latest and greatest (also officially contributed), based on tbb20_20080408oss_src (most recent stable release), tested for linux_ia32.h and mac_ppc.h.

Apparently I had overestimated the power and reach of SFINAE (unless it's just a g++ limitation?), so I had to subspecialise some operations for ordered to use the default delegation anyway; an alternative might be to be more precise (and verbose) on the specialisations, thus excluding ordered in the first place.

I surrendered on the issue of not being able to mix partial specialisation and parameter defaults, so now I have atomic{,_relaxed,_rel_acq,_ordered}, with atomic behaving like atomic_rel_acq (anyone wanting to mix release, acquire and ordered might add atomic_mixed to the mix, but I'm still waiting for a good validating example); feel free to educate me on what I may have overlooked.

Still to do: more tests in test_atomic.cpp, finish other ports than linux_ia32.h and mac_ppc.h, but mainly I'm quite happy with this, so it should be mature enough for deeper scrutiny.

RafSchietekat · ‎05-29-2008

Latest and greatest (also officially contributed), based on tbb20_20080408oss_src (most recent stable release), tested for linux_ia32.h (except that I merged this with linux_emt64t.h into gcc_x86_64_32.h) and mac_ppc.h.

Mainly laid the foundations for SPARC support (see gcc_sparc.h), but I have to wait until Tuesday to validate and test it, so feel free to comment on it first (I welcome corrections and test results).

ARCH_R_Intel · ‎05-29-2008

I've just had time to read so far. I like the inheritance hierarchy in the implementation classes.

One problem I see is an elegance versus practical efficiency issue with the return type on operator&=, operator|=, and operator^=. So far, atomic has been designed to support only atomic operations that have efficiently hardware support on typical platforms. E.g., atomic::operator+= exists because most modern hardware has fetch-and-add support. (Perversely, my own employer's Itanium has perhaps the weakest support for it because of the way it restricts the addend to certain constants.) But I'm not familiar with any widely used hardware that has fetch-and-{and,or,xor} support. However, x86 does have efficient "lock {and,or,xor}" instructions for the situation where an atomic bitwise operation is desired without knowning the prior value. Alas C++ does not allow overload resolution to choose based on whether the return value is used or not.

If there is no hardware support for fetch-and-{and,or,xor}, it might make more sense to generalize bitwise fetch-and-op to a template memberfetch_and_op(binaryFunctor) that does an arbitrary fetch-and-op.

So far, the only client in TBB of bitwise atomic ops is spin_rw_mutex, and it definitely wants the efficient x86 implementations that do not return a value. I suppose we could continue to support those ops as __TBB_Atomic{AND,OR} macros.

ARCH_R_Intel · ‎05-29-2008

The __TBB_fence_guard is a clever trick. We'll have to inspect whether it has an impact on performance. Alas not-so-clever compilers can generate a lot of gratuitous extra code for sake of calling a destructor in event of an exception, even if an exception is not possible. Only inspection of the binary code will tell us if this is an issue or not. If code quality becomes an issue, we can break __TBB_fence_guard into two functions, one for entry and one for exit, which would be a slighly less elegant style but still retain genericity.

Dmitry_Vyukov · ‎05-29-2008

Raf_Schietekat:
Latest and greatest (also officially contributed), based on tbb20_20080408oss_src (most recent stable release), tested for linux_ia32.h (except that I merged this with linux_emt64t.h into gcc_x86_64_32.h) and mac_ppc.h.
Mainly laid the foundations for SPARC support (see gcc_sparc.h), but I have to wait until Tuesday to validate and test it, so feel free to comment on it first (I welcome corrections and test results).

Sorry for stupid question, but how I can view this change? Latest development release dated by 'May 14, 2008'. And I don't see cvs/svn access...

Dmitriy V'jukov

RafSchietekat · ‎05-30-2008

Regarding Arch's last two posts:

Adding add/subtract/and/or/xor to atomic would be easy: __TBB_Op is already in place to support atomic::store (the same pattern can be used to link it up for the others) and has been implemented for __TBB_op_{store,add,and,or,xor} where possible to support __TBB_Atomic{AND,OR} as well as store and also because I'm a pedantic orthogonalist, so I'm definitely not against the idea.

I don't see a performance benefit for operations that don't need the existing value, only a register allocation benefit: the x86, as a CISC instruction set, is tailored to this "linguistically", and it has few (visible) registers, but internally it will still have to fetch the old value first, so in a way it is ironical that it only has instructions that discard that old value. Which (other) processors have I overlooked that do fetch-and-add better than fetch-and-{and,or,xor}? Well, there's also the CISC vs. RISC thing (instruction count), but that's a separate issue, I think; and I don't really know why not more processors allow locking the bus (seems rather greedy, though) or an address (using cache coherence hardware), but maybe there would be insufficient benefit over looping to consider that.

Generalizing is exactly what goes on underneath (in __TBB_(Fetch)Op), but it looks a bit heavy syntax-wise to expose that to the user. I'm not sure whether doing the same for a general user-supplied function would be beneficial, because there is some syntactic overhead to a functor that may offset the benefit of not having to write a simple compare-and-swap loop, and how often would it be used anyway (and if wanted, it can always be added as an external template operation). I'm only hesitating about whether POWER/PowerPC can do anything with reservations to leverage an intrusive design.

The question seems to be whether atomic should be the basis for locks or not. If they should be, there's a design problem with platforms that don't have self-contained atomic support, although I can't immediately think of any that are not nearing end of life (PA-RISC). Otherwise it seems fine to continue to rely on a layer that is not exposed in atomic, and the (non-rhetorical) question becomes what other purposes non-fetch add/subtract/and/or/xor would serve.

I do worry a bit about performance, but for the moment only about non-optimised builds that do not optimise away calls to inline functions. GNU has __attribute__((always_inline)) or somesuch to come to the rescue if needed; I don't know about the others. But I have never considered compilers, if any, that would generate code where nothing is needed for stack unwinding: surely after inlining they would come to their senses?

A question relating to locking the bus or not: to implement various operations, RISC processors will typically loop over compare-and-swap without something like TBB's AtomicBackoff, or that is what manuals suggest anyway. Until now I took it on faith that AtomicBackoff is like what Ethernet does for collision avoidance, but how certain are you that it is beneficial for short unblockable operations? Maybe the maximum pause should be limited or even be variable instead of ever-increasing (with its clandestine biblical bias: the first will be the last and the last shall be the first)? And one step further (maybe one too far?), would be the question whether the reportedly unfavorable test results for a skip list map included such a backoff on the atomics or not?

RafSchietekat · ‎05-30-2008

Dmitriy (and others), after you have obtained tbb20_20080408oss_src from the download page (stable, not development), the following may help (copied from a makefile, please adapt to your situation):

cp $(FROM)/gcc_power.h cp $(TO)/include/tbb/machine
cp $(FROM)/gcc_sparc.h $(TO)/include/tbb/machine
cp $(FROM)/gcc_x86_64_32.h $(TO)/include/tbb/machine
cp $(FROM)/ibm_aix51.h $(TO)/include/tbb/machine
cp $(FROM)/linux_common.h $(TO)/include/tbb/machine
cp $(FROM)/linux_em64t.h $(TO)/include/tbb/machine
cp $(FROM)/linux_ia32.h $(TO)/include/tbb/machine
cp $(FROM)/linux_itanium.h $(TO)/include/tbb/machine
cp $(FROM)/mac_ppc.h $(TO)/include/tbb/machine
cp $(FROM)/windows_em64t.h $(TO)/include/tbb/machine
cp $(FROM)/windows_ia32.h $(TO)/include/tbb/machine
cp $(FROM)/windows_ia32_inline.h $(TO)/include/tbb/machine
cp $(FROM)/atomic.h $(TO)/include/tbb
cp $(FROM)/tbb_machine.h $(TO)/include/tbb
cp $(FROM)/atomic_support.c $(TO)/src/tbb/ibm_aix51
cp $(FROM)/queuing_rw_mutex.cpp $(TO)/src/tbb
cp $(FROM)/tbb_misc.h $(TO)/src/tbb
cp $(FROM)/test_atomic.cpp $(TO)/src/test
cp $(FROM)/test_compiler.cpp $(TO)/src/test
cp $(FROM)/test_rwm_upgrade_downgrade.cpp $(TO)/src/test

Actually that's from my current list after I changed gcc_x86.h back to the rather silly gcc_x86_64_32.h still used in the zip file; I hope I didn't forget anything else.

robert-reed · ‎05-30-2008

There are no stupid questions, only stupid answers, like uh,this one:we don't have cvs/svn access. That idea has been kicked around and I just sent a missive internally to kick around another idea, but for now the quickest way for you to see Raf's contribution would be if Raf could send you a copy. (Told you it was going to be a stupid answer.) Meanwhile we'll work on this and I'll see if we can come up with a better answer.

RafSchietekat · ‎05-30-2008

I'm sure I'm misinterpreting Robert's last post, but just in case: you only need Downloads>"Stable Release">tbb20_20080408oss>tbb20_20080408oss_src.tar.gz and "TBB Discussion Forum">"Additions to atomic">{atomic.Raf_Schietekat.20080529.zip from "05-29-2008, 12:08 PM" and patch script from item 32 which is the second post on the second page that does not have an absolute timestamp yet}.

Dmitry_Vyukov · ‎05-30-2008

Raf_Schietekat:
I'm sure I'm misinterpreting Robert's last post, but just in case: you only need Downloads>"Stable Release">tbb20_20080408oss>tbb20_20080408oss_src.tar.gz and "TBB Discussion Forum">"Additions to atomic">{atomic.Raf_Schietekat.20080529.zip from "05-29-2008, 12:08 PM" and patch script from item 32 which is the second post on the second page that does not have an absolute timestamp yet}.

Oh... now I see attachment to the post :)
Ok, thank you. I will try to combine the release and the patch.

Btw, concerning fetch-and-{and,or,xor}.
In my implementation of atomics I provide 2 versions of each function:

value_t fetch_add(value_t v);
void add(value_t v);

value_t fetch_and(value_t v);
void and(value_t v);

and so on...

I'm still not sure whether it makes some sense...

Dmitriy V'jukov

robert-reed · ‎06-02-2008

Sorry for the confusion, Raf. I'd seen that you had posted at least one of your atomic proposals to the Forum as well as submitting them to TBB-contrib, but had forgotten that detail when I composed my latest reply. I'm glad that Dmitriy was able to get a copy. We're working through the internal discussions now to plan how we might regularly share contributions so that our contributors don't have to.

RafSchietekat · ‎06-04-2008

Latest and greatest (also officially contributed), based on tbb20_20080408oss_src (most recent stable release), tested for x86-based GNU/Linux and PowerPC-based Mac OS X, see README for applying the patch.

Better SPARC support (still to be tested), first draft for Alpha support.

RafSchietekat · ‎06-20-2008

Latest and greatest (also officially contributed), based on tbb20_20080408oss_src (not the most recent stable release anymore, but...), tested for x86-based GNU/Linux (not yet for 32-bit PowerPC-based Mac OS X), see README for applying the patch (I forgot to include this file in the official contribution).

Locked atomics (currently only used for 64 bits on 32-bit PowerPC), hopefully more complete support for Itanium, successful test for SPARC (well, at some point during the making of this version anyway), various things.

Dmitry_Vyukov · ‎06-20-2008

Raf_Schietekat:

Locked atomics (currently only used for 64 bits on 32-bit PowerPC)

And what about early x86-64 systems, which doesn't support double-word cmpxchg instruction? ;)

Dmitriy V'jukov

RafSchietekat · ‎06-22-2008

Dmitriy, feel free to contribute what I need to know (I suppose you mean quadword, because doubleword seems to be 32 bits on IA-32/Intel 64). I already merged the 32-bit and 64-bit variants of the x86 architecture (as described in Intel's documentation as IA-32/Intel 64), so it should be relatively easy to fit in an intermediate version. I just need to know the relevant predefined macros and what subset is available (everything except 64-bit cmpxchg?), I suppose (unless it's not a subset, in which case I need some more documentation).

Anyone, feel free to test anything other than g++ on 32-bit x86 and 32-bit PowerPC (those I can easily do myself, so they'll only need final validation later on, except that for the latter I could only build part of TBB). I did a test of SPARC at some point, but for the others I can only hope that I made no mistakes (it's not impossible, but...). In particular, Itanium needs some attention (totally revised), but maybe you want to use an Alpha machine a little while longer (not yet in TBB); I don't know how much interest there would be in PA-RISC (which would require a revision of some TBB assumptions because unlocked semaphores are non-zero), or in MIPS (I could not easily find documentation for it)? I have merged with tbb21_20080605oss by now, if anyone needs that instead.

P.S.: Hey, I reached a century (or how is that in cricket speak?)!

RafSchietekat · ‎07-05-2008

Latest and greatest (also officially contributed), based on tbb21_20080605oss (most recent stable release), tested for x86-based GNU/Linux and partially for 32-bit PowerPC-based Mac mini (single core), see README for applying the patch.

Not really interesting, just an upgrade to a new TBB stable release version, and 64-bit atomics didn't actually work yet on 32-bit PowerPC in the previous version (I was away from my Mac mini and made a few too many assumptions). The one interesting thing is that I've found that... it doesn't work for a release build on x86 with g++: test_atomic.exe fails for 64-bit atomics, and I don't know why. It does work for a debug build, and for both builds on Mac OS X, but it fails for release on Ubuntu 6.06 for real or on 8.04 in Sun xVM VirtualBox. Of course I'll keep trying to find out why (an extra problem is that sometimes the problem disappears when I add diagnostics, although it seems to be perfectly reproducible between runs if the code is the same), but if you'd like to, e.g., compile to assembler and investigate that to tell me what I missed (the relevant section should be fairly short), please go right ahead.

Oh yes, the revised Itanium code still has never been tested, and any other useful comments are welcome as well.

Dmitry_Vyukov · ‎07-07-2008

Raf_Schietekat:
Dmitriy, feel free to contribute what I need to know (I suppose you mean quadword, because doubleword seems to be 32 bits on IA-32/Intel 64). I already merged the 32-bit and 64-bit variants of the x86 architecture (as described in Intel's documentation as IA-32/Intel 64), so it should be relatively easy to fit in an intermediate version. I just need to know the relevant predefined macros and what subset is available (everything except 64-bit cmpxchg?), I suppose (unless it's not a subset, in which case I need some more documentation).

Sorry for acting like a$$ :)

I mean that some early x86-64 chips lacks cmpxchg16b instruction (128-bit, double-word CAS, in terms of 64-bit platform). And you can't detect cmpxchg16b support statically, you can detect it only in run-time, by analysing values returned by CPUID (eax=1) instruction. So you can't just define some macros...

Dmitriy V'jukov