Additions to atomic<T>

ARCH_R_Intel · ‎05-09-2008

The discussion in http://software.intel.com/en-us/forums//topic/58670raises the point that TBB's atomic operations are lacking:

A way to express an unfenced atomic load
A way to express a full fence (i.e. fence with both acquire and release semantics)

(2) can be expressed pathetically as atomic::fetch_and_add(0).

I'm wondering about opinions on syntax and generality for the additions suggested above. The simplest approach would be to add:

T atomic::unfenced_load() const
void atomic::fence() const

Another approach would be to add T atomic::load() const, and allow M to specify acquire, release, full fence, or unfenced. But my opinion is that drags inmuch pointless complexity, because it adds "full_fence" and "unfenced" to the possibilities for M that is a template parameter for other atomic ops too, and the "release" variant for a load seems of dubious value.

Would the "simplest approach" be enough?

- Arch

RafSchietekat · ‎05-09-2008

I've actually been working on just such a proposal (which compiles but needs some more work after stalling on bug 124), because the C++0x people are getting ready to converge on some far less elegant syntax than what TBB currently has (kudos for that!), and TBB is lacking in the memory-semantics department. I would stick to the templated-operations approach (no separate names, or function arguments like the current C++0x proposal, which will hopefully be optimised away?), with default memory semantics for each kind of operation, and perhaps, e.g., load specialised as only a declaration but without a definition. Let the compiler worry about the complexities, I say, as long as the programmer experience is pleasant enough. Just be careful about equating a full fence with one "with both acquire and release semantics", because it is not the same: even for IA-32 and Intel 64 (I have never quite figured out how these two relate), SFENCE+LFENCE (or LFENCE+SFENCE) is not an MFENCE (but I guess TBB targets more processors than have MFENCE as a cheaper alternative than CPUID for full serialisation?), and the current proposal for C++0x actually distinguishes acq_rel (I don't know why in that order) and seq_cst (full serialisation).

A probably major difference of opinion is that C++0x chooses a full fence as the default for load and store, because of one infamous example, whereas Intel seems totally committed to ordered loads on the one hand and ordered stores on the other, with no way to break free even if the programmer wants to (and a locked increment actually implying a full fence!), and I still have not convinced myself that I understand all the implications. It's actually a schizofrenic situation within the C++0x scene, because they now have this full-fence default, even though there are probably some who consider raw/unfenced/relaxed to be the main usage model for atomics.

Should I contribute my current version now? It's still a work in progress, though, even if it currently builds just fine. My opinion is that the main objective should be to improve what C++0x will impose on us all, and time is short.

ARCH_R_Intel · ‎05-09-2008

You are right that a full fence is not the same as "acquire;release" or "release;acquire".That latter can be reordered to be the former, and the former allows "x; acquire; release; y" to be reordered as "acquire; x; y; release". I should have been more specific and said that "full fence" is the same as atomic execution of acquire+release.

In my opinion, LFENCE and SFENCE are regrettable beasts from the precambrian explosion of memory consistency models, before modern acquire/release models evolved.

From the viewpoint of the C++ 200x draft, Intel IA-32 and Intel 64processors effectively have seq_cst semantics for all LOCK-prefixed operations (per white paper), acquire semantics for ordinary oads, and release semantics for ordinary stores. Itanium has seq_cst for its fetch-and-add, exchange, and compare-and-swap.

For increment, you can break free with atomic::fetch_and_increment or the elease> variant. Wedon't have a fenceless variant in TBB because it seems to have limited use. Except for unfenced/raw loads, I don't see major utility of unfenced atomic operations. Counters for statistics seems to be the repeated example.

If were to add to support fenceless variants, we could add a variant, template members for load and store. There's something to be said for adding a named "load" and "store" member templates to allow fencing variants. My concern is that a lot of these variants are silly. E.g., what use is a store-with-acquire or load-with-release?

I'm curious to see your proposal.

Dmitry_Vyukov · ‎05-11-2008

If I understand you correctly (maybe it's just my english), you misinterpreting std::memory_model_acq_rel. It's a full fence. It's not acquire+release in the sense like SFENCE+LFENCE on x86. Yeah, name 'acq_rel' is quite confusing.
And std::memory_model_seq_cst is not full fence. It's full fence + global operation order. So default memory order in C++0x is not full fence, it's full fence + global operation order.

About relaxed atomics. There is extremely important case (except statistics :) ) - reference counting with basic thread safety:
void rc_acquire(rc_t* obj)
{
rc->counter.increment(memory_model_relaxed);
}

Also Paul McKenney in rationale for memory model for C++0x provides following example:

for (size_t i = 0; i != mailbox_count; ++i)
{
if (msg_t* m = atomic_xchg(&mailbox.head, 0, memory_model_relaxed))
{
// executing stand-alone fence only here if got message, not for every check
fence_acquire();
process(m);
}
}

This is also rationale for stand-alone fences.

About default memory order in C++0x.
Well, yes, seq_cst is not intended for usage in high-performance synchronization algorithms. Anyway there is rationale for making seq_cst default. As I understand, first you prototype algorithm with seq_cst. When you see that it's working with seq_cst, you optinally and selectively start replacing seq_cst with weaker type of fences in critical points (on fast-path). On slow-path you can leave seq_cst. And undoubtedly seq_cst is extremely easier to reason about.

You can see Paul McKenney's proposal for memory model for C++0x:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2237.pdf
He concludes:

"This is a prioritized list of differences of this model versus ISOMM:
1. Provide standalone memory fences. This is crucial to avoid the introduction of unnecessary
ordering constraints. Our proposal has been to introduce only three forms of ordering constraints,
but an alternative is to include all possible fences (LoadLoad, LoadStore, StoreLoad & StoreStore)
to allow exploitation on hardware architectures that provide such primitives.
2. Do not require ordering on atomic operations over what is specified by their ordering constraints.
In particular, allow full reordering of raw atomic operations, and allow load-acquire operations to
be reordered ahead of preceding store-release operations.
3. Allow atomic operations to be removed if they are found to be redundant based on sequential
program analysis.
4. Define a mechanism to enable ordering of dependent memory operations, both through control
flow or data flow dependencies.
5. Define a mechanism to allow ordering of atomic operations without introducing any hardware
primitives."

Dmitriy V'jukov

RafSchietekat · ‎05-12-2008

About Arch's message, paragraph by paragraph:

A full fence defined this way leaves out sequential consistency, strictly speaking (memory_model_acq_rel vs. memory_model_seq_cst), which may or may not become visible on some processors (which ones?), though not on IA-32/Intel 64. Which processors might that be, and what are the performance considerations for making this (farfetched?) difference? Dmitriy?

Why regrettable, if they are hidden in locks and atomics? It seems a bigger problem that now IA-32/Intel 64 does not allow a way out to provide cheap raw/unfenced/relaxed atomic operations, especially involving locked operations, which now impose a full fence (also see below).

IA-32/Intel 64 also has seq_cst for XCHG (implicit lock signal), other than the more explicit serialisation instructions.

As long as any processors out there allow unfenced anything, it would seem presumptuous not to allow them to take advantage of it. I think that the statistics example may have been overused (it does seem like grasping at straws), because the bread-and-butter rationale would be reference-counted pointers (see another recent thread that commented on their slowness, which may or may not be caused by IA-32/Intel 64 imposing full-fence semantics?). Dmitriy also mentioned this, I see, and it seems a very strong reason.

The question is whether to allow all combinations orthogonally, or to impose any perceived wisdom on the user through undefined template specialisations. The challenge is to do this across all architectures (do they have incompatible sweet spots or can they agree?).

I've been tinkering with my code some more, but it seems that I don't have all the facts together yet (and the good weather is beckoning me to get outside), so most is still to be done. The end result should be simplicity, of course: the user gets some sensible default semantics, or chooses his own, and the code tries to get that result the cheapest way it can on the particular architecture.

That covers Arch's message. I wrote most of this before Dmitriy posted his message, so...

With a bit of elementary-particle physics, fences may exist in the following combinations: unfenced/acquire/release/release+acquire as one causality-related family, and unfenced/reload/flush/reload+flush as another sequential consistency-related family. In addition to what their counterparts in the first family do, a reload will not move up across a flush, which gets rid of relativity (mixing my metaphores), but I don't know yet if that is all that is different. Then there's the matter of what it means to have a device that unidirectionally prevents reordering of anything (does it make sense at all?), the meaning of LoadStore, rationale for stand-alone fences, a lot of other things and whether they are relevant or just there to confuse me... So I'm still wondering about such matters, and Dmitriy just sabotaged my attempt to limit my scope.

(Corrected) Removed some things in last paragraph.

Dmitry_Vyukov · ‎05-12-2008

Raf_Schietekat:
About Arch's message, paragraph by paragraph:
A full fence defined this way leaves out sequential consistency, strictly speaking (memory_model_acq_rel vs. memory_model_seq_cst), which may or may not become visible on some processors (which ones?), though not on IA-32/Intel 64. Which processors might that be, and what are the performance considerations for making this (farfetched?) difference? Dmitriy?

I'm not ready to answer about processors, but there are definitely some processors without total store order (I think that it's SPARC not in TSO mode, maybe PPC, maybe ARM).
Full fence without sequential consistency can lead to very counter-intuitive results. I think that seq_cst added mainly to provide clear and easy-to-reason semantics. And to provide tool for prototyping.

See page 11 in presentation "The Future of Concurrency in C++":
http://www.justsoftwaresolutions.co.uk/files/future_of_concurrency.pdf
You can replace release and acquire fences with acq_rel (full) fences - it will not change the output.
As you can see output is very counter-intuitive.

See page 12 where seq_cst is used. Result is intuitive.

Raf_Schietekat:

Why regrettable, if they are hidden in locks and atomics? It seems a bigger problem that now IA-32/Intel 64 does not allow a way out to provide cheap raw/unfenced/relaxed atomic operations, especially involving locked operations, which now impose a full fence (also see below).

Agree. But I don't think that we can change anything here :)

Raf_Schietekat:

The question is whether to allow all combinations orthogonally, or to impose any perceived wisdom on the user through undefined template specialisations. The challenge is to do this across all architectures (do they have incompatible sweet spots or can they agree?).

AFAIK, Current C++0x standard prohibits only store-acquire, store-acq-rel, load-release and load-acq-rel.
I don't know rationale behind this.

Raf_Schietekat:

[...] fences may exist in the following combinations: [...] unfenced/reload/flush/reload+flush as another sequential consistency-related family.

Please describe semantics of unfenced/reload/flush/reload+flush in more detail.

Dmitriy V'jukov

Dmitry_Vyukov · ‎05-12-2008

Raf_Schietekat:

With a bit of elementary-particle physics, fences may exist in the following combinations: unfenced/acquire/release/release+acquire as one causality-related family, and unfenced/reload/flush/reload+flush as another sequential consistency-related family. In addition to what their counterparts in the first family do, a reload will not move up across a flush, which gets rid of relativity (mixing my metaphores), but I don't know yet if that is all that is different. Then there's the matter of what it means to have a device that unidirectionally prevents reordering of anything (does it make sense at all?), the meaning of LoadStore, rationale for stand-alone fences, a lot of other things and whether they are relevant or just there to confuse me... So I'm still wondering about such matters, and Dmitriy just sabotaged my attempt to limit my scope.

For now I stick to following approach.
First of all, whenever possible I use C++0x interfaces, semantics and names. For example, unfenced/naked vs relaxed. It seems that for now C++0x is sufficiently stable wrt memory model and atomics. Current C++0x draft:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2588.pdf

In my personal opinion, C++0x provides too restrictive semantics. Luckily it's not only my opinion, so I also take as
basis proposals for C++0x memory model/atomics from Peter Dimov, Alex Terekhov, Paul McKenney:
http://groups.google.com/group/comp.programming.threads/msg/b43cd6c9411c95b9
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2237.pdf
I hope that they think enough before making proposal, so I can not think :)

Initially I was going to implement atomics with templates:
atomic_store(&x, 1);
I was thinking that it's the only efficient method. And following approach will invietably add unnecessary overheads:
atomic_store(&x, 1, memory_model_something);

Finally I figure out how to implement atomics with C++0x interface (atomic_store(&x, 1, memory_model_something)) in efficient and comfortable way. Here is code sketch:

/*** Pay attention to inheritance ***/

// fake root
struct memory_order {};

/***** full fences *****/

// full fence + total order
struct mo_seq_cst_t : memory_order {};
// full fence
struct mo_acq_rel_t : mo_seq_cst_t {};

/***** release fences *****/

// classic release
struct mo_release_t : mo_acq_rel_t {};
// release not affecting stores
struct mo_release_load_t : mo_release_t {};
// release not affecting loads
struct mo_release_store_t : mo_release_t {};

/***** relaxed fences *****/

// does not order memory
struct mo_relaxed_t : mo_acq_rel_t {};

extern mo_seq_cst_t memory_order_seq_cst;
extern mo_acq_rel_t memory_order_acq_rel;

extern mo_release_t& nbsp; memory_order_release;
extern mo_release_load_t memory_order_release_load;
extern mo_release_store_t memory_order_release_store;

/*** Implementation for MSVC/x86 ***/

// Only store - other operations stripped

class atomic32
{
public:
typedef unsigned value_type;

explicit atomic32(value_type v = value_type())
: value_(v)
{
}

void store(value_type v, memory_order = memory_order_seq_cst) volatile
{
_InterlockedExchange((long*)&value_, v);
}

void store(value_type v, mo_release_t) volatile
{
_ReadWriteBarrier();
value_ = v;
}

void store(value_type v, mo_relaxed_t) volatile
{
value_ = v;
}

private:
value_type volatile value_;

atomic32(atomic32 const&);
atomic32& operator = (atomic32 const&);

/*** forbidden ***/
void store(value_type v, mo_acq_rel_t) volatile;
};

Basic rule: I always provide implementation for seq_cst. Then I provide specialized implementations, if they can be implemented in more effective way on current architecture (release). Then I move forbidden combinations to private section (acq_rel).
All other things are handled by inheritance. I.e. store with mo_acquire_t is also forbidden. Store with mo_release_store_t uses implementation for store with mo_release_t.

Also I add a bunch of compiler fences (affects only code generation):

/***** compiler fences *****/

struct co_acq_rel_t : mo_acq_rel_t {};
struct co_acquire_t : co_acq_rel_t {};
struct co_release_t : co_acq_rel_t {};

I'm not sure about bidirectional fences (store-store (sfence), load-load (lfence) ). For now I just comment them out.

What do you think?

Dmitriy V'jukov

RafSchietekat · ‎05-12-2008

That p. 11 example is not to the point for illustrating acq_rel, I would say. As for allowing all combinations orthogonally vs. imposing the designer's wisdom I might be convinced by the argument that sometimes a language gives you more by the things it does not allow you to do (but I can't say now where I've heard that before). In the second family I mentioned, reload would mean "forget any cached reads" (not actually reload, but the best word I could find was what it would imply later on), and flush would mean "write all dirty entries to memory now and wait until finished"; I don't know how useful flush and reload might be by themselves, though.

Oh boy, Alexander Terekhov's list again... I tried to ignore that for lack of a legend (what do the entries all mean?), so I obviously don't know whether it's useful at all. So now it really is all back on the table. One problem is that too much detail might be too confusing and lead to bugs.

What is the context for your atomics proposal?

I dislike the C++0x proposal because it looks too much like plain C, and I remember how an active desire (driven by IBM, I think) to be compatible with plain C ruined another standard, the C++ CORBA mapping (which otherwise might have had things like a sequence template class, and not have suffered debilitating leaks by using something like auto_ptr sources and sinks instead of plain pointers). Here it's not as problematic, but it just looks ugly, and elegance is a real goal, and I would hate it if the current C++0x atomics proposal goes to press like this (it's my civic duty to oppose it).

Maybe you don't need to use the memory-order argument in a switch or anything, but it will still be part of the call (right?), whereas a template argument is completely invisible. Is there any reason not to use a templated function like TBB does, where template specialisation could do the same as having different overloads? There's a whole lot of things that can be done with template metaprogramming... I was thinking of having architecture-specific traits for the operations, and using those to select appropriate "packaging" implementations for the different memory semantics.

ARCH_R_Intel · ‎05-12-2008

With respect to the reference-counting example, my impression is that it is typically unsafe to leave out the fence, because the increment of the reference count typically implies that a thread is acquiring rights to read a shared reference-counted object.

Dmitry_Vyukov · ‎05-12-2008

MADadrobiso:
With respect to the reference-counting example, my impression is that it is typically unsafe to leave out the fence, because the increment of the reference count typically implies that a thread is acquiring rights to read a shared reference-counted object.

Provided so-called 'basic thread-safety' (thread permitted to acquire reference to object *only* if it aready has one, very common case, for example boost::shared_ptr) it's ok to remove fence, because thread doesn't actually acquiring anything.
Provided so-called 'strong thread-safety' (thread permitted to acquire reference to object even if it doesn't have one, very uncommon case) it unsafe to remove fence, because thread actually acquiring access to object.

Dmitriy V'jukov

RafSchietekat · ‎05-12-2008

A shared pointer cannot protect its referent, because its control ends when operator* or operator-> returns, so a shared referent needs to protect itself. The only valid consideration is how often any fence implicit to reference count manipulation is wasted, which is probably often enough to be a concern: just reading from the referent, destroying the pointer, ...

Dmitry_Vyukov · ‎05-12-2008

randomizer:

Provided so-called 'basic thread-safety' (thread permitted to acquire reference to object *only* if it aready has one, very common case, for example boost::shared_ptr) it's ok to remove fence, because thread doesn't actually acquiring anything.

In this case mechanism which transfers object between threads (producer-consumer queue) must execute acquire fence, because this mechanism gives access to object to thread.

Dmitriy V'jukov

Dmitry_Vyukov · ‎05-14-2008

Raf_Schietekat:

That p. 11 example is not to the point for illustrating acq_rel, I would say.

It illustrates difference between acq_rel and seq_cst.

Raf_Schietekat:

As for allowing all combinations orthogonally vs. imposing the designer's wisdom I might be convinced by the argument that sometimes a language gives you more by the things it does not allow you to do (but I can't say now where I've heard that before).

I think you are right. This will eliminate at least a bit of complexity.
But I am thinking about special cases. For example sequence lock requires load with memory_model_release_wrt_loads (or msync::slb in Terekhov's list).

Raf_Schietekat:

In the second family I mentioned, reload would mean "forget any cached reads" (not actually reload, but the best word I could find was what it would imply later on), and flush would mean "write all dirty entries to memory now and wait until finished"; I don't know how useful flush and reload might be by themselves, though.

It looks quite unusual and confusing...

Raf_Schietekat:

Oh boy, Alexander Terekhov's list again... I tried to ignore that for lack of a legend (what do the entries all mean?)

There is a legend. Imho the semantics are clear...

Raf_Schietekat:

, so I obviously don't know whether it's useful at all. So now it really is all back on the table. One problem is that too much detail might be too confusing and lead to bugs.

I think, that it's better to dive into as much detail as possible, and then try to climb as high as possible. It's the only way to choose the right altitude.

Raf_Schietekat:

What is the context for your atomics proposal?

None. It's just my library. Similar to TBB.

Raf_Schietekat:

I dislike the C++0x proposal because it looks too much like plain C, and I remember how an active desire (driven by IBM, I think) to be compatible with plain C ruined another standard, the C++ CORBA mapping (which otherwise might have had things like a sequence template class, and not have suffered debilitating leaks by using something like auto_ptr sources and sinks instead of plain pointers). Here it's not as problematic, but it just looks ugly, and elegance is a real goal, and I would hate it if the current C++0x atomics proposal goes to press like this (it's my civic duty to oppose it).

Maybe you don't need to use the memory-order argument in a switch or anything, but it will still be part of the call (right?), whereas a template argument is completely invisible. Is there any reason not to use a templated function like TBB does, where template specialisation could do the same as having different overloads? There's a whole lot of things that can be done with template metaprogramming... I was thinking of having architecture-specific traits for the operations, and using those to select appropriate "packaging" implementations for the different memory semantics.

First of all, I not a C fan too. I like templates, template metaprogramming etc.
The main reason to make memory model as parameter - compliance with C++0x. It's The Principle Of Least Surprise. It doesn't metter whether it's better or worse, it's what user expects and knows.
At the present I don't see anything considerable what templates can give here. Handling of "inheritance" between fence types will be more complicated with templates.
Please elaborate about "things that can be done with template metaprogramming" wrt atomics/fence types. Can you provide some examples?

Dmitriy V'jukov

RafSchietekat · ‎05-14-2008

The p. 11 example does not illustrate acq_rel, because either the acq or the rel part has nothing to do with the operations, and acq_rel would even be forbidden in C++0x. With RMW operations on two atomics where sequential consistency or not makes a difference, I would be interested, now I'm not.

I don't see what's so confusing about something that might be a bit unusual (yes, I made it up myself).

I don't understand Alexander Terekhov's list. There might be a legend, but is it self-contained, with examples etc.? Or is it just me? I seriously doubt that! I think I've worked hard enough trying to understand this stuff, more than most people, and this makes no sense to me, all by itself.

Yes, get the details, but then there's a standard to be made that's not a recipe for bugs, and that starts with some level of understandability.

Well, at this point I'm not sure I want to go ahead with proposing an atomics library alternative. For starters, I'm unsure about the real cost of these arguments (can you give some insight in both the C++0x proposal, where I can imagine that thorough optimisation with constant propagation and dead-code elimination will do the trick, and in yours, where I just don't know), which is the only decisive argument for changing things this late in the game. And if you accept that the memory semantics are passed as an argument, then there's a subset (the atomic template) that actually makes sense, and is a superset of TBB atomic, which is currently lacking in functionality (bitwise operators and memory semantics), and which will soon be a squashed mosquito on the windshield of the new C++ standard. It's just that there's a whole load of redundant rubbish API thrown on top, and, given the choice, some programmers will undoubtedly use it. As for the use of templates, all complexity should be hidden as much as possible from the normal user and even from the porting user, of course, but it will hardly be more complicated than what TBB has now. But I have not yet started doing anything remotely sophisticated on this, and I think I might just play around with this a bit without committing or even seriously considering to produce a result. What's the use, anyway: even if I can show something really elegant (which is still a big question), how could it compete? Even if I'm not setting myself up for failure, I would be setting myself up for disappointment, wouldn't I?

ARCH_R_Intel · ‎05-14-2008

My inclination for TBB is:

Make the default for atomic read-modify-write operations the equivalent of C++ 200x's seq_cst. For Intel architectures, this is just a matter of changing the documentation, not the implementation.
Make atomic::operator= retain its "release" semantics and atomic::operator T retain its "acquire" semantics. As with Raf's class, we can add new member functions "load" and "store" to deal with the other cases. If I was doing TBB from scratch, I might make the deaults sequentially consistent. But changing atomic::operator= to sequential consistency now risks a huge performance impact on existing code.
Not add the C++ 200x acq_rel semantics. They seem to offer little gain on most processors, and many hazards. My impressionis that acq_rel would noteven exist in C++ 200x except for the sake of the Power architecture.
Add the C++ 200x relaxed option. This discussion has convinced me there is sufficient justification, even if there is not currently an efficient way to implement it on Intel architectures.

So we would end up with four options: sequentially consistent, acquire, release, and relax.

TBB 2.1 is pretty much frozen, so these changes will be have to be made after that. I'm swamped on last-minute bug fixes for 2.1, so I have not been able to give this forum much time :-(

Another example of a non-TSO processor is Itanium. Ordinary Itanium stores are not TSO. Itanium st.rel stores are TSO.

RafSchietekat · ‎05-17-2008

Of course I couldn't help myself, and now I'm well along the way with an implementation that supports those extra memory semantics (and bitwise operators), even if only for use with TBB. It seems simpler to push memory semantics and templates down into the platform-related files: to port, implement some required template specialisations, plus as many others as may benefit performance.

Dmitriy, maybe I'll study Alexander Terekhov's list again later (no promise, though); I was too quick in brushing it aside this time: there's more information now than when I first saw it (I think), plus your opinion that it is worth considering.

A general question: What might not be evil about having function arguments (C++0x) instead of template arguments (TBB) for memory semantics? Shouldn't the compiler help check that they are fixed at compile time, and make programmers jump through hoops to deviate from that, and isn't that what template arguments enable the compiler to do? Dmitriy, would you agree that this is a considerable advantage with template arguments, of higher value than The Principle Of Least Surprise (which also depends on what your reference is)? Sometimes a language gives you more by not allowing you to do some things (in this case letting the semantics vary haphazardly). Here it sits between specifically named operations (C style, difficult to customise) and complete run-time laissez-faire. Oh yes, and could the compiler really get rid of function-argument overhead as compared to template arguments?

"If I was doing TBB from scratch, I might make the deaults sequentially consistent." I don't see why that would be such a good idea, as compared to education. Quite surprising, as well, coming from Intel. It is rather fool proof, I know, but with such a default, code will likely be overburdened with load and store until you won't be able to see the forest for the trees anymore, and any significant deviation from the pattern would be obscured instead of standing out, encouraging bugs again. There's another option that has not been considered yet: an atomic template argument that determines how the operations behave by default (relaxed for reference counters etc., default acquire/release for messaging, ordered for special applications); why just take over one-size-fits-all Java volatile semantics if C++ allows extra freedom (where it makes sense)?

"Not add the C++ 200x acq_rel semantics." Then I feel compelled to be the champion of the little guy (IBM)!

RafSchietekat · ‎05-21-2008

To my own general question: And if the choice is not required to be fixed at compile time, there can be no compile-time checking of that fixed value! Templated operations are definitely the way to go then.

Progress report: I think I'm just about finished for linux_ia32.h, except that I got a compile-time (!) error about compare_and_swap in queuing_mutex.cpp. One solution might be to assume success (with failure, an spurious StoreStore barrier might be issued, which might cause some unnecessary delay on non-TSO platforms and require some explanation on TSO platforms), another to implement double semantics for compare_and_swap (one for success, one for failure) as on C++0x (which has default failure semantics based on success semantics that are equivalent with assuming success except for potentially better performance). The easiest would be the first option, which requires just changing false->true in a handful of places and adding a comment, but it might be biased against non-TSO platforms (is there a real performance cost?), so any C++0x interest would then be biased against it.

After resolving this issue (not a strict requirement), I should be able to do some final cleanup and contribute the solution (some work will/may then be required to re-port (most of) the other platforms).

(Added) Oh yes, there's still that idea of having configurable default semantics...

RafSchietekat · ‎05-21-2008

Here's something that might serve as the basis of an implementation with more memory semantics choices (and bitwise operators), based on tbb20_20080408oss_src (most recent stable release). It seems to work on Linux for IA-32, but a lot of tests need to be added to validate that (I just did the normal build all and neglected unit testing for now).

Template metaprogramming buffs might want to check the syntax (I just improvised a bit, it seems more like hacking than programming anyway).

I didn't go ahead with configurable default semantics because partial template specialisation and default template arguments don't seem to mix, not on my GCC anyway.

The biggest test is how well it can be ported to other architectures. Apparently the Intel compiler for Itanium doesn't even need compiler fences with volatile, but for others the code still needs a solution for not overloading the code with fences if the basic operation is already ordered (mac_ppc.h), etc.

I also contributed this in the official way to TBB.

Worthless? Wonderful? Somewhere in-between?

Dmitry_Vyukov · ‎05-21-2008

Sorry for late reply - I really don't have much time...

I think that question "memory order as templates vs function parameters" actually more is a matter of personal taste.
In either case compiler will eliminate all runtime overhead (100% for templates and 99% for parameters).
For end-user it's really equally.
For implementor it's almost equally. With parameters I can use "inheritance" of memory orders. For templates it's a bit harder.
In either case memory order must be fixed at compile time. For templates it's a bit easier to enforce.
Some sophisticated template metaprogramming... I don't see what it can give to implementor in this case.
Prohibition of "bad" combinations of operation/memory_order (load-release, store-acquire) must be prohibited in either case. It's equally easy to implement with templates and parameters.

As for default parameters. I think that it's bad idea in this case. memory_order_seq_cst is the strongest order so it's the only can be made default. I really don't want it this way:

counter.store(value);

I want it this way:

// memory_order_release: synchronizes with acquire operation in function foo()
counter.store(value, std::memory_order_release);

Dmitriy V'jukov

RafSchietekat · ‎05-21-2008

(Removed)

"In either case compiler will eliminate all runtime overhead (100% for templates and 99% for parameters)." Really? Also in non-optimised builds? Also in inner loops or to implement spinning locks? I think there might well be a big difference between 100% and "99%".

"With parameters I can use "inheritance" of memory orders." But that seems only useful for *not* implementing some memory semantics, whereas I found it is better to just plug in policy objects (__TBB_fence_guard in my proposal) and only use delegation in other dimensions than memory semantics, e.g., going from store to looping on compare_and_swap.

"In either case memory order must be fixed at compile time. For templates it's a bit easier to enforce." Are you using inheritance graph design to test things at compile time (not test time)? And (how) do you enforce that the semantics are fixed at compile time?

"Some sophisticated template metaprogramming... I don't see what it can give to implementor in this case." I think template metaprogramming is somewhat disgusting, because the language was not designed for that (I found it to be a constant struggle), but with memory_semantics as an enum I don't see another way to test conditions at compile time; also see previous question.

"As for default parameters. I think that it's bad idea in this case." And I think that the default should be configurable, and documented where the atomic variable was declared (like you would do with a lock, right?): the code should follow from the role, not the other way around. Also there is what I wrote about not seeing the forest for the trees anymore if for better performance nearly every use was changed to release/acquire anyway. Ordered semantics are probably the exception, and I would not be surprised if ordered/seq_cst could be easily eliminated for use of stand-alone fence instructions (design-wise, not just technically feasible). Some people, and not the least of them, think that atomics are there primarily for their own sake (raw/relaxed/unfenced), so it seems a far stretch to take sequential consistency to be the default, except to achieve shortsighted fool-proofness (shortsighted because the code will degenerate to unsightliness before it is fully optimised, inviting bugs instead of discouraging them). Of course it is always possible that I just don't have enough experience with atomics for my intuition about these things to be reliable yet, but then it should be fairly easy to refute these arguments.

ARCH_R_Intel · ‎05-22-2008

There is the comment "// - ordered for special purposes". What is "ordered"?

There is the comment "// - rel_acq for "normal" release/acquire/rel_acq defaults (it would be the defaults' default)". What is "rel_acq"? Same as ISO C++ draft or something else?

- Arch