- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The discussion in http://software.intel.com/en-us/forums//topic/58670raises the point that TBB's atomic operations are lacking:
- A way to express an unfenced atomic load
- A way to express a full fence (i.e. fence with both acquire and release semantics)
(2) can be expressed pathetically as atomic
I'm wondering about opinions on syntax and generality for the additions suggested above. The simplest approach would be to add:
- T atomic
::unfenced_load() const - void atomic
::fence() const
Another approach would be to add T atomic
Would the "simplest approach" be enough?
- Arch
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A probably major difference of opinion is that C++0x chooses a full fence as the default for load and store, because of one infamous example, whereas Intel seems totally committed to ordered loads on the one hand and ordered stores on the other, with no way to break free even if the programmer wants to (and a locked increment actually implying a full fence!), and I still have not convinced myself that I understand all the implications. It's actually a schizofrenic situation within the C++0x scene, because they now have this full-fence default, even though there are probably some who consider raw/unfenced/relaxed to be the main usage model for atomics.
Should I contribute my current version now? It's still a work in progress, though, even if it currently builds just fine. My opinion is that the main objective should be to improve what C++0x will impose on us all, and time is short.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You are right that a full fence is not the same as "acquire;release" or "release;acquire".That latter can be reordered to be the former, and the former allows "x; acquire; release; y" to be reordered as "acquire; x; y; release". I should have been more specific and said that "full fence" is the same as atomic execution of acquire+release.
In my opinion, LFENCE and SFENCE are regrettable beasts from the precambrian explosion of memory consistency models, before modern acquire/release models evolved.
From the viewpoint of the C++ 200x draft, Intel IA-32 and Intel 64processors effectively have seq_cst semantics for all LOCK-prefixed operations (per white paper), acquire semantics for ordinary oads, and release semantics for ordinary stores. Itanium has seq_cst for its fetch-and-add, exchange, and compare-and-swap.
For increment, you can break free with atomic
If were to add to support fenceless variants, we could add a
I'm curious to see your proposal.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
And std::memory_model_seq_cst is not full fence. It's full fence + global operation order. So default memory order in C++0x is not full fence, it's full fence + global operation order.
About relaxed atomics. There is extremely important case (except statistics :) ) - reference counting with basic thread safety:
void rc_acquire(rc_t* obj)
{
rc->counter.increment(memory_model_relaxed);
}
Also Paul McKenney in rationale for memory model for C++0x provides following example:
for (size_t i = 0; i != mailbox_count; ++i)
{
if (msg_t* m = atomic_xchg(&mailbox.head, 0, memory_model_relaxed))
{
// executing stand-alone fence only here if got message, not for every check
fence_acquire();
process(m);
}
}
This is also rationale for stand-alone fences.
About default memory order in C++0x.
Well, yes, seq_cst is not intended for usage in high-performance synchronization algorithms. Anyway there is rationale for making seq_cst default. As I understand, first you prototype algorithm with seq_cst. When you see that it's working with seq_cst, you optinally and selectively start replacing seq_cst with weaker type of fences in critical points (on fast-path). On slow-path you can leave seq_cst. And undoubtedly seq_cst is extremely easier to reason about.
You can see Paul McKenney's proposal for memory model for C++0x:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2237.pdf
He concludes:
"This is a prioritized list of differences of this model versus ISOMM:
1. Provide standalone memory fences. This is crucial to avoid the introduction of unnecessary
ordering constraints. Our proposal has been to introduce only three forms of ordering constraints,
but an alternative is to include all possible fences (LoadLoad, LoadStore, StoreLoad & StoreStore)
to allow exploitation on hardware architectures that provide such primitives.
2. Do not require ordering on atomic operations over what is specified by their ordering constraints.
In particular, allow full reordering of raw atomic operations, and allow load-acquire operations to
be reordered ahead of preceding store-release operations.
3. Allow atomic operations to be removed if they are found to be redundant based on sequential
program analysis.
4. Define a mechanism to enable ordering of dependent memory operations, both through control
flow or data flow dependencies.
5. Define a mechanism to allow ordering of atomic operations without introducing any hardware
primitives."
Dmitriy V'jukov
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A full fence defined this way leaves out sequential consistency, strictly speaking (memory_model_acq_rel vs. memory_model_seq_cst), which may or may not become visible on some processors (which ones?), though not on IA-32/Intel 64. Which processors might that be, and what are the performance considerations for making this (farfetched?) difference? Dmitriy?
Why regrettable, if they are hidden in locks and atomics? It seems a bigger problem that now IA-32/Intel 64 does not allow a way out to provide cheap raw/unfenced/relaxed atomic operations, especially involving locked operations, which now impose a full fence (also see below).
IA-32/Intel 64 also has seq_cst for XCHG (implicit lock signal), other than the more explicit serialisation instructions.
As long as any processors out there allow unfenced anything, it would seem presumptuous not to allow them to take advantage of it. I think that the statistics example may have been overused (it does seem like grasping at straws), because the bread-and-butter rationale would be reference-counted pointers (see another recent thread that commented on their slowness, which may or may not be caused by IA-32/Intel 64 imposing full-fence semantics?). Dmitriy also mentioned this, I see, and it seems a very strong reason.
The question is whether to allow all combinations orthogonally, or to impose any perceived wisdom on the user through undefined template specialisations. The challenge is to do this across all architectures (do they have incompatible sweet spots or can they agree?).
I've been tinkering with my code some more, but it seems that I don't have all the facts together yet (and the good weather is beckoning me to get outside), so most is still to be done. The end result should be simplicity, of course: the user gets some sensible default semantics, or chooses his own, and the code tries to get that result the cheapest way it can on the particular architecture.
That covers Arch's message. I wrote most of this before Dmitriy posted his message, so...
With a bit of elementary-particle physics, fences may exist in the following combinations: unfenced/acquire/release/release+acquire as one causality-related family, and unfenced/reload/flush/reload+flush as another sequential consistency-related family. In addition to what their counterparts in the first family do, a reload will not move up across a flush, which gets rid of relativity (mixing my metaphores), but I don't know yet if that is all that is different. Then there's the matter of what it means to have a device that unidirectionally prevents reordering of anything (does it make sense at all?), the meaning of LoadStore, rationale for stand-alone fences, a lot of other things and whether they are relevant or just there to confuse me... So I'm still wondering about such matters, and Dmitriy just sabotaged my attempt to limit my scope.
(Corrected) Removed some things in last paragraph.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raf_Schietekat:
About Arch's message, paragraph by paragraph:
A full fence defined this way leaves out sequential consistency, strictly speaking (memory_model_acq_rel vs. memory_model_seq_cst), which may or may not become visible on some processors (which ones?), though not on IA-32/Intel 64. Which processors might that be, and what are the performance considerations for making this (farfetched?) difference? Dmitriy?
I'm not ready to answer about processors, but there are definitely some processors without total store order (I think that it's SPARC not in TSO mode, maybe PPC, maybe ARM).
Full fence without sequential consistency can lead to very counter-intuitive results. I think that seq_cst added mainly to provide clear and easy-to-reason semantics. And to provide tool for prototyping.
See page 11 in presentation "The Future of Concurrency in C++":
http://www.justsoftwaresolutions.co.uk/files/future_of_concurrency.pdf
You can replace release and acquire fences with acq_rel (full) fences - it will not change the output.
As you can see output is very counter-intuitive.
See page 12 where seq_cst is used. Result is intuitive.
Raf_Schietekat:
Why regrettable, if they are hidden in locks and atomics? It seems a bigger problem that now IA-32/Intel 64 does not allow a way out to provide cheap raw/unfenced/relaxed atomic operations, especially involving locked operations, which now impose a full fence (also see below).
Agree. But I don't think that we can change anything here :)
Raf_Schietekat:
The question is whether to allow all combinations orthogonally, or to impose any perceived wisdom on the user through undefined template specialisations. The challenge is to do this across all architectures (do they have incompatible sweet spots or can they agree?).
AFAIK, Current C++0x standard prohibits only store-acquire, store-acq-rel, load-release and load-acq-rel.
I don't know rationale behind this.
Raf_Schietekat:
[...] fences may exist in the following combinations: [...] unfenced/reload/flush/reload+flush as another sequential consistency-related family.
Please describe semantics of unfenced/reload/flush/reload+flush in more detail.
Dmitriy V'jukov
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raf_Schietekat:
With a bit of elementary-particle physics, fences may exist in the following combinations: unfenced/acquire/release/release+acquire as one causality-related family, and unfenced/reload/flush/reload+flush as another sequential consistency-related family. In addition to what their counterparts in the first family do, a reload will not move up across a flush, which gets rid of relativity (mixing my metaphores), but I don't know yet if that is all that is different. Then there's the matter of what it means to have a device that unidirectionally prevents reordering of anything (does it make sense at all?), the meaning of LoadStore, rationale for stand-alone fences, a lot of other things and whether they are relevant or just there to confuse me... So I'm still wondering about such matters, and Dmitriy just sabotaged my attempt to limit my scope.
For now I stick to following approach.
First of all, whenever possible I use C++0x interfaces, semantics and names. For example, unfenced/naked vs relaxed. It seems that for now C++0x is sufficiently stable wrt memory model and atomics. Current C++0x draft:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2588.pdf
In my personal opinion, C++0x provides too restrictive semantics. Luckily it's not only my opinion, so I also take as
basis proposals for C++0x memory model/atomics from Peter Dimov, Alex Terekhov, Paul McKenney:
http://groups.google.com/group/comp.programming.threads/msg/b43cd6c9411c95b9
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2237.pdf
I hope that they think enough before making proposal, so I can not think :)
Initially I was going to implement atomics with templates:
atomic_store
I was thinking that it's the only efficient method. And following approach will invietably add unnecessary overheads:
atomic_store(&x, 1, memory_model_something);
Finally I figure out how to implement atomics with C++0x interface (atomic_store(&x, 1, memory_model_something)) in efficient and comfortable way. Here is code sketch:
/*** Pay attention to inheritance ***/
// fake root
struct memory_order {};
/***** full fences *****/
// full fence + total order
struct mo_seq_cst_t : memory_order {};
// full fence
struct mo_acq_rel_t : mo_seq_cst_t {};
/***** release fences *****/
// classic release
struct mo_release_t : mo_acq_rel_t {};
// release not affecting stores
struct mo_release_load_t : mo_release_t {};
// release not affecting loads
struct mo_release_store_t : mo_release_t {};
/***** relaxed fences *****/
// does not order memory
struct mo_relaxed_t : mo_acq_rel_t {};
extern mo_seq_cst_t memory_order_seq_cst;
extern mo_acq_rel_t memory_order_acq_rel;
extern mo_release_t& nbsp; memory_order_release;
extern mo_release_load_t memory_order_release_load;
extern mo_release_store_t memory_order_release_store;
/*** Implementation for MSVC/x86 ***/
// Only store - other operations stripped
class atomic32
{
public:
typedef unsigned value_type;
explicit atomic32(value_type v = value_type())
: value_(v)
{
}
void store(value_type v, memory_order = memory_order_seq_cst) volatile
{
_InterlockedExchange((long*)&value_, v);
}
void store(value_type v, mo_release_t) volatile
{
_ReadWriteBarrier();
value_ = v;
}
void store(value_type v, mo_relaxed_t) volatile
{
value_ = v;
}
private:
value_type volatile value_;
atomic32(atomic32 const&);
atomic32& operator = (atomic32 const&);
/*** forbidden ***/
void store(value_type v, mo_acq_rel_t) volatile;
};
Basic rule: I always provide implementation for seq_cst. Then I provide specialized implementations, if they can be implemented in more effective way on current architecture (release). Then I move forbidden combinations to private section (acq_rel).
All other things are handled by inheritance. I.e. store with mo_acquire_t is also forbidden. Store with mo_release_store_t uses implementation for store with mo_release_t.
Also I add a bunch of compiler fences (affects only code generation):
/***** compiler fences *****/
struct co_acq_rel_t : mo_acq_rel_t {};
struct co_acquire_t : co_acq_rel_t {};
struct co_release_t : co_acq_rel_t {};
I'm not sure about bidirectional fences (store-store (sfence), load-load (lfence) ). For now I just comment them out.
What do you think?
Dmitriy V'jukov
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Oh boy, Alexander Terekhov's list again... I tried to ignore that for lack of a legend (what do the entries all mean?), so I obviously don't know whether it's useful at all. So now it really is all back on the table. One problem is that too much detail might be too confusing and lead to bugs.
What is the context for your atomics proposal?
I dislike the C++0x proposal because it looks too much like plain C, and I remember how an active desire (driven by IBM, I think) to be compatible with plain C ruined another standard, the C++ CORBA mapping (which otherwise might have had things like a sequence template class, and not have suffered debilitating leaks by using something like auto_ptr sources and sinks instead of plain pointers). Here it's not as problematic, but it just looks ugly, and elegance is a real goal, and I would hate it if the current C++0x atomics proposal goes to press like this (it's my civic duty to oppose it).
Maybe you don't need to use the memory-order argument in a switch or anything, but it will still be part of the call (right?), whereas a template argument is completely invisible. Is there any reason not to use a templated function like TBB does, where template specialisation could do the same as having different overloads? There's a whole lot of things that can be done with template metaprogramming... I was thinking of having architecture-specific traits for the operations, and using those to select appropriate "packaging" implementations for the different memory semantics.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
MADadrobiso:
With respect to the reference-counting example, my impression is that it is typically unsafe to leave out the fence, because the increment of the reference count typically implies that a thread is acquiring rights to read a shared reference-counted object.
Provided so-called 'basic thread-safety' (thread permitted to acquire reference to object *only* if it aready has one, very common case, for example boost::shared_ptr) it's ok to remove fence, because thread doesn't actually acquiring anything.
Provided so-called 'strong thread-safety' (thread permitted to acquire reference to object even if it doesn't have one, very uncommon case) it unsafe to remove fence, because thread actually acquiring access to object.
Dmitriy V'jukov
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
randomizer:
Provided so-called 'basic thread-safety' (thread permitted to acquire reference to object *only* if it aready has one, very common case, for example boost::shared_ptr) it's ok to remove fence, because thread doesn't actually acquiring anything.
In this case mechanism which transfers object between threads (producer-consumer queue) must execute acquire fence, because this mechanism gives access to object to thread.
Dmitriy V'jukov
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raf_Schietekat:
That p. 11 example is not to the point for illustrating acq_rel, I would say.
It illustrates difference between acq_rel and seq_cst.
Raf_Schietekat:
As for allowing all combinations orthogonally vs. imposing the designer's wisdom I might be convinced by the argument that sometimes a language gives you more by the things it does not allow you to do (but I can't say now where I've heard that before).
I think you are right. This will eliminate at least a bit of complexity.
But I am thinking about special cases. For example sequence lock requires load with memory_model_release_wrt_loads (or msync::slb in Terekhov's list).
Raf_Schietekat:
In the second family I mentioned, reload would mean "forget any cached reads" (not actually reload, but the best word I could find was what it would imply later on), and flush would mean "write all dirty entries to memory now and wait until finished"; I don't know how useful flush and reload might be by themselves, though.
It looks quite unusual and confusing...
Raf_Schietekat:
Oh boy, Alexander Terekhov's list again... I tried to ignore that for lack of a legend (what do the entries all mean?)
There is a legend. Imho the semantics are clear...
Raf_Schietekat:
, so I obviously don't know whether it's useful at all. So now it really is all back on the table. One problem is that too much detail might be too confusing and lead to bugs.
I think, that it's better to dive into as much detail as possible, and then try to climb as high as possible. It's the only way to choose the right altitude.
Raf_Schietekat:
What is the context for your atomics proposal?
None. It's just my library. Similar to TBB.
Raf_Schietekat:
I dislike the C++0x proposal because it looks too much like plain C, and I remember how an active desire (driven by IBM, I think) to be compatible with plain C ruined another standard, the C++ CORBA mapping (which otherwise might have had things like a sequence template class, and not have suffered debilitating leaks by using something like auto_ptr sources and sinks instead of plain pointers). Here it's not as problematic, but it just looks ugly, and elegance is a real goal, and I would hate it if the current C++0x atomics proposal goes to press like this (it's my civic duty to oppose it).
Maybe you don't need to use the memory-order argument in a switch or anything, but it will still be part of the call (right?), whereas a template argument is completely invisible. Is there any reason not to use a templated function like TBB does, where template specialisation could do the same as having different overloads? There's a whole lot of things that can be done with template metaprogramming... I was thinking of having architecture-specific traits for the operations, and using those to select appropriate "packaging" implementations for the different memory semantics.
First of all, I not a C fan too. I like templates, template metaprogramming etc.
The main reason to make memory model as parameter - compliance with C++0x. It's The Principle Of Least Surprise. It doesn't metter whether it's better or worse, it's what user expects and knows.
At the present I don't see anything considerable what templates can give here. Handling of "inheritance" between fence types will be more complicated with templates.
Please elaborate about "things that can be done with template metaprogramming" wrt atomics/fence types. Can you provide some examples?
Dmitriy V'jukov
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't see what's so confusing about something that might be a bit unusual (yes, I made it up myself).
I don't understand Alexander Terekhov's list. There might be a legend, but is it self-contained, with examples etc.? Or is it just me? I seriously doubt that! I think I've worked hard enough trying to understand this stuff, more than most people, and this makes no sense to me, all by itself.
Yes, get the details, but then there's a standard to be made that's not a recipe for bugs, and that starts with some level of understandability.
Well, at this point I'm not sure I want to go ahead with proposing an atomics library alternative. For starters, I'm unsure about the real cost of these arguments (can you give some insight in both the C++0x proposal, where I can imagine that thorough optimisation with constant propagation and dead-code elimination will do the trick, and in yours, where I just don't know), which is the only decisive argument for changing things this late in the game. And if you accept that the memory semantics are passed as an argument, then there's a subset (the atomic template) that actually makes sense, and is a superset of TBB atomic, which is currently lacking in functionality (bitwise operators and memory semantics), and which will soon be a squashed mosquito on the windshield of the new C++ standard. It's just that there's a whole load of redundant rubbish API thrown on top, and, given the choice, some programmers will undoubtedly use it. As for the use of templates, all complexity should be hidden as much as possible from the normal user and even from the porting user, of course, but it will hardly be more complicated than what TBB has now. But I have not yet started doing anything remotely sophisticated on this, and I think I might just play around with this a bit without committing or even seriously considering to produce a result. What's the use, anyway: even if I can show something really elegant (which is still a big question), how could it compete? Even if I'm not setting myself up for failure, I would be setting myself up for disappointment, wouldn't I?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My inclination for TBB is:
- Make the default for atomic read-modify-write operations the equivalent of C++ 200x's seq_cst. For Intel architectures, this is just a matter of changing the documentation, not the implementation.
- Make atomic
::operator= retain its "release" semantics and atomic ::operator T retain its "acquire" semantics. As with Raf's class, we can add new member functions "load" and "store" to deal with the other cases. If I was doing TBB from scratch, I might make the deaults sequentially consistent. But changing atomic ::operator= to sequential consistency now risks a huge performance impact on existing code. - Not add the C++ 200x acq_rel semantics. They seem to offer little gain on most processors, and many hazards. My impressionis that acq_rel would noteven exist in C++ 200x except for the sake of the Power architecture.
- Add the C++ 200x relaxed option. This discussion has convinced me there is sufficient justification, even if there is not currently an efficient way to implement it on Intel architectures.
So we would end up with four options: sequentially consistent, acquire, release, and relax.
TBB 2.1 is pretty much frozen, so these changes will be have to be made after that. I'm swamped on last-minute bug fixes for 2.1, so I have not been able to give this forum much time :-(
Another example of a non-TSO processor is Itanium. Ordinary Itanium stores are not TSO. Itanium st.rel stores are TSO.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dmitriy, maybe I'll study Alexander Terekhov's list again later (no promise, though); I was too quick in brushing it aside this time: there's more information now than when I first saw it (I think), plus your opinion that it is worth considering.
A general question: What might not be evil about having function arguments (C++0x) instead of template arguments (TBB) for memory semantics? Shouldn't the compiler help check that they are fixed at compile time, and make programmers jump through hoops to deviate from that, and isn't that what template arguments enable the compiler to do? Dmitriy, would you agree that this is a considerable advantage with template arguments, of higher value than The Principle Of Least Surprise (which also depends on what your reference is)? Sometimes a language gives you more by not allowing you to do some things (in this case letting the semantics vary haphazardly). Here it sits between specifically named operations (C style, difficult to customise) and complete run-time laissez-faire. Oh yes, and could the compiler really get rid of function-argument overhead as compared to template arguments?
"If I was doing TBB from scratch, I might make the de
"Not add the C++ 200x acq_rel semantics." Then I feel compelled to be the champion of the little guy (IBM)!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Progress report: I think I'm just about finished for linux_ia32.h, except that I got a compile-time (!) error about compare_and_swap
After resolving this issue (not a strict requirement), I should be able to do some final cleanup and contribute the solution (some work will/may then be required to re-port (most of) the other platforms).
(Added) Oh yes, there's still that idea of having configurable default semantics...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Template metaprogramming buffs might want to check the syntax (I just improvised a bit, it seems more like hacking than programming anyway).
I didn't go ahead with configurable default semantics because partial template specialisation and default template arguments don't seem to mix, not on my GCC anyway.
The biggest test is how well it can be ported to other architectures. Apparently the Intel compiler for Itanium doesn't even need compiler fences with volatile, but for others the code still needs a solution for not overloading the code with fences if the basic operation is already ordered (mac_ppc.h), etc.
I also contributed this in the official way to TBB.
Worthless? Wonderful? Somewhere in-between?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think that question "memory order as templates vs function parameters" actually more is a matter of personal taste.
In either case compiler will eliminate all runtime overhead (100% for templates and 99% for parameters).
For end-user it's really equally.
For implementor it's almost equally. With parameters I can use "inheritance" of memory orders. For templates it's a bit harder.
In either case memory order must be fixed at compile time. For templates it's a bit easier to enforce.
Some sophisticated template metaprogramming... I don't see what it can give to implementor in this case.
Prohibition of "bad" combinations of operation/memory_order (load-release, store-acquire) must be prohibited in either case. It's equally easy to implement with templates and parameters.
As for default parameters. I think that it's bad idea in this case. memory_order_seq_cst is the strongest order so it's the only can be made default. I really don't want it this way:
counter.store(value);
I want it this way:
// memory_order_release: synchronizes with acquire operation in function foo()
counter.store(value, std::memory_order_release);
Dmitriy V'jukov
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"In either case compiler will eliminate all runtime overhead (100% for templates and 99% for parameters)." Really? Also in non-optimised builds? Also in inner loops or to implement spinning locks? I think there might well be a big difference between 100% and "99%".
"With parameters I can use "inheritance" of memory orders." But that seems only useful for *not* implementing some memory semantics, whereas I found it is better to just plug in policy objects (__TBB_fence_guard in my proposal) and only use delegation in other dimensions than memory semantics, e.g., going from store
"In either case memory order must be fixed at compile time. For templates it's a bit easier to enforce." Are you using inheritance graph design to test things at compile time (not test time)? And (how) do you enforce that the semantics are fixed at compile time?
"Some sophisticated template metaprogramming... I don't see what it can give to implementor in this case." I think template metaprogramming is somewhat disgusting, because the language was not designed for that (I found it to be a constant struggle), but with memory_semantics as an enum I don't see another way to test conditions at compile time; also see previous question.
"As for default parameters. I think that it's bad idea in this case." And I think that the default should be configurable, and documented where the atomic variable was declared (like you would do with a lock, right?): the code should follow from the role, not the other way around. Also there is what I wrote about not seeing the forest for the trees anymore if for better performance nearly every use was changed to release/acquire anyway. Ordered semantics are probably the exception, and I would not be surprised if ordered/seq_cst could be easily eliminated for use of stand-alone fence instructions (design-wise, not just technically feasible). Some people, and not the least of them, think that atomics are there primarily for their own sake (raw/relaxed/unfenced), so it seems a far stretch to take sequential consistency to be the default, except to achieve shortsighted fool-proofness (shortsighted because the code will degenerate to unsightliness before it is fully optimised, inviting bugs instead of discouraging them). Of course it is always possible that I just don't have enough experience with atomics for my intuition about these things to be reliable yet, but then it should be fairly easy to refute these arguments.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There is the comment "// - ordered for special purposes". What is "ordered"?
There is the comment "// - rel_acq for "normal" release/acquire/rel_acq defaults (it would be the defaults' default)". What is "rel_acq"? Same as ISO C++ draft or something else?
- Arch

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page