The discussion in http://software.intel.com/en-us/forums//topic/58670raises the point that TBB's atomic operations are lacking:
(2) can be expressed pathetically as atomic
I'm wondering about opinions on syntax and generality for the additions suggested above. The simplest approach would be to add:
Another approach would be to add T atomic
Would the "simplest approach" be enough?
You are right that a full fence is not the same as "acquire;release" or "release;acquire".That latter can be reordered to be the former, and the former allows "x; acquire; release; y" to be reordered as "acquire; x; y; release". I should have been more specific and said that "full fence" is the same as atomic execution of acquire+release.
In my opinion, LFENCE and SFENCE are regrettable beasts from the precambrian explosion of memory consistency models, before modern acquire/release models evolved.
From the viewpoint of the C++ 200x draft, Intel IA-32 and Intel 64processors effectively have seq_cst semantics for all LOCK-prefixed operations (per white paper), acquire semantics for ordinary oads, and release semantics for ordinary stores. Itanium has seq_cst for its fetch-and-add, exchange, and compare-and-swap.
For increment, you can break free with atomic
If were to add to support fenceless variants, we could add a
I'm curious to see your proposal.
Raf_Schietekat:About Arch's message, paragraph by paragraph:
A full fence defined this way leaves out sequential consistency, strictly speaking (memory_model_acq_rel vs. memory_model_seq_cst), which may or may not become visible on some processors (which ones?), though not on IA-32/Intel 64. Which processors might that be, and what are the performance considerations for making this (farfetched?) difference? Dmitriy?
Why regrettable, if they are hidden in locks and atomics? It seems a bigger problem that now IA-32/Intel 64 does not allow a way out to provide cheap raw/unfenced/relaxed atomic operations, especially involving locked operations, which now impose a full fence (also see below).
The question is whether to allow all combinations orthogonally, or to impose any perceived wisdom on the user through undefined template specialisations. The challenge is to do this across all architectures (do they have incompatible sweet spots or can they agree?).
[...] fences may exist in the following combinations: [...] unfenced/reload/flush/reload+flush as another sequential consistency-related family.
With a bit of elementary-particle physics, fences may exist in the following combinations: unfenced/acquire/release/release+acquire as one causality-related family, and unfenced/reload/flush/reload+flush as another sequential consistency-related family. In addition to what their counterparts in the first family do, a reload will not move up across a flush, which gets rid of relativity (mixing my metaphores), but I don't know yet if that is all that is different. Then there's the matter of what it means to have a device that unidirectionally prevents reordering of anything (does it make sense at all?), the meaning of LoadStore, rationale for stand-alone fences, a lot of other things and whether they are relevant or just there to confuse me... So I'm still wondering about such matters, and Dmitriy just sabotaged my attempt to limit my scope.
MADadrobiso:With respect to the reference-counting example, my impression is that it is typically unsafe to leave out the fence, because the increment of the reference count typically implies that a thread is acquiring rights to read a shared reference-counted object.
Provided so-called 'basic thread-safety' (thread permitted to acquire reference to object *only* if it aready has one, very common case, for example boost::shared_ptr) it's ok to remove fence, because thread doesn't actually acquiring anything.
That p. 11 example is not to the point for illustrating acq_rel, I would say.
As for allowing all combinations orthogonally vs. imposing the designer's wisdom I might be convinced by the argument that sometimes a language gives you more by the things it does not allow you to do (but I can't say now where I've heard that before).
In the second family I mentioned, reload would mean "forget any cached reads" (not actually reload, but the best word I could find was what it would imply later on), and flush would mean "write all dirty entries to memory now and wait until finished"; I don't know how useful flush and reload might be by themselves, though.
Oh boy, Alexander Terekhov's list again... I tried to ignore that for lack of a legend (what do the entries all mean?)
, so I obviously don't know whether it's useful at all. So now it really is all back on the table. One problem is that too much detail might be too confusing and lead to bugs.
What is the context for your atomics proposal?
I dislike the C++0x proposal because it looks too much like plain C, and I remember how an active desire (driven by IBM, I think) to be compatible with plain C ruined another standard, the C++ CORBA mapping (which otherwise might have had things like a sequence template class, and not have suffered debilitating leaks by using something like auto_ptr sources and sinks instead of plain pointers). Here it's not as problematic, but it just looks ugly, and elegance is a real goal, and I would hate it if the current C++0x atomics proposal goes to press like this (it's my civic duty to oppose it).
Maybe you don't need to use the memory-order argument in a switch or anything, but it will still be part of the call (right?), whereas a template argument is completely invisible. Is there any reason not to use a templated function like TBB does, where template specialisation could do the same as having different overloads? There's a whole lot of things that can be done with template metaprogramming... I was thinking of having architecture-specific traits for the operations, and using those to select appropriate "packaging" implementations for the different memory semantics.
My inclination for TBB is:
So we would end up with four options: sequentially consistent, acquire, release, and relax.
TBB 2.1 is pretty much frozen, so these changes will be have to be made after that. I'm swamped on last-minute bug fixes for 2.1, so I have not been able to give this forum much time :-(
Another example of a non-TSO processor is Itanium. Ordinary Itanium stores are not TSO. Itanium st.rel stores are TSO.