Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Pluggable default memory semantics for atomics

RafSchietekat
Valued Contributor III
597 Views

I am of the strong opinion that C++11 goofed when it decreed sequential consistency as the default memory semantics for all atomic operations. It is an excessive general penalty (either performance cost or coding overhead) just to prevent the occasional mishap (relatively few uses need sequential consistency at the level of the atomic operation).

TBB defaults are far more reasonable, although I still have my doubts about sequential consistency for read-modify-write operations, which seems to be inspired by Intel's own Architecture (x86) more than by typical needs.

However, neither supports the widespread need for shared variables that don't need memory semantics, like flags and diagnostic counters. Since such usage often also doesn't even need temporal atomicity (read-modify-write), just spatial atomicity (no tearing, i.e., all bits transferred at the same time), if even that (for bool, the temptation is huge to assume that it is just a byte, right?), programmers of all origin (even parallel programming toolkit development teams), will happily slap a "volatile" label on such a variable and declare the code fit for shipping.

I think that this is unfortunate, because by and large it assumes suitable alignment for spatial atomicity (often provided, but then why do we even need alignment declarations?), there are assumedly (although I'm not sure) wasted fences to keep volatile accesses in the same thread out of each others' hair (in that sense "volatile" would be stronger than "relaxed atomic"), and it would probably confuse analysis of the code by automatic tools (not verified).

The section "Multi-threaded executions and data races" has no exemption for volatile objects, so they are still potentially subject to data races, resulting in undefined behaviour. I think it's no excuse that we all know that the program will not cause your computer to spontaneously combust, and that "volatile" has been a trusted workaround for the lack of atomics all these years before C++11. It's still wrong.

So what's the solution? I think that "atomic" should support pluggable default memory semantics, which is relatively easy to retrofit on tbb::atomic at least. You should be able to choose between sequentially consistent (like C++11), traditional TBB defaults (see Reference Manual), the same but only release-acquire for read-modify-write operations (at this time only conceptually different from a full fence), relaxed, and explicit (where no defaults are provided).

With such an enriched "atomic" type, it's trivial to replace, e.g., "volatile int" with "tbb::atomic<int, tbb::default_memory_semantics::relaxed>", which can be typedef'ed for convenience. All uses can remain unchanged, except for increment and decrement, where the choice is between relief that it now finally works as intended and optionally breaking up the operation for performance sake.

I'm eager to hear your comments about this...

 

0 Kudos
14 Replies
RafSchietekat
Valued Contributor III
597 Views

Here's a possible implementation. It builds on an up-to-date Mac OS X setup, and I think it works as intended (except for swap).

If you want to test it yourself, just unzip and merge with a full copy of the source files. You now have an optional second template argument for tbb::atomic, e.g., tbb::default_memory_semantics::relaxed (choices are full_fence, tbb, rel_acq, relaxed, disabled, as described in previous posting). All operations and operators should now use the new default memory semantics, unless explicitly overridden (as before).

For example, with rel_acq, a read-modify-write operation/operator will first release, perform the operation/operator, and then acquire. For a store, it will only release, and for a load, it will only acquire, of course. As it happens, this is just like TBB's defaults, because rel_acq is currently conservatively approximated as a full fence, but your code will be self-documenting and future-proof (on architectures with weaker implicit memory semantics)!

Have a look at default_memory_semantics for the other choices if there is any doubt about what they do.

And yes, I also think that C++11 goofed by calling it acq_rel, because that's the opposite of what's supposed to happen.

(2014-10-04 Edited) Added the word "optional" above, in case this wasn't obvious: by default you get the traditional behaviour, and existing code should not need to be rebuilt.

0 Kudos
RafSchietekat
Valued Contributor III
597 Views

Elevator pitch:

With a (shared) shorthand declaration

typedef atomic<bool, default_memory_semantics::relaxed> SharedFlag;

you can replace the questionable definition

volatile bool flag = false;

with

SharedFlag flag;

Done!

0 Kudos
RafSchietekat
Valued Contributor III
597 Views

Of course you could also go the other way, with default_memory_semantics::full_fence, for those (rare) situations where you have a lot of operations/operators that should have full fences around them, or maybe because you're just not sure what to do with some strange algorithm you found in a document that assumed C++11 atomics or didn't bother to spell out the appropriate memory semantics for "other" reasons. But the main use case would probably still be to either somewhat or fully relax an atomic (to "rel_acq" resp. "relaxed"), or to be confident to have added explicit memory semantics everywhere (using "disabled").

Warning: speculation ahead!!!

However... there's another possible goal here: maybe, instead of just replacing the defaults, you might want to override even the explicit memory semantics, because there might be a mistake in how they were applied, and you want to test whether using full fences everywhere cures the problem? Or vice versa, you have a suspicion that memory semantics were set too conservatively, and you want to do a quick test to see what's what. Of course, that's actually 2 or 3 possible applications: set a (conservative) lower limit, do an exact override, or set an (exploratory) upper limit (not clear whether this is useful in addition to an exact override).

I tend to want to be sure to consider all possible angles before making a change, and maybe this overriding of explicit memory semantics is a bit of a stretch, compared to the undoubtedly useful replacement of defaults. Still, it's probably doable enough to modify atomic.h to accept an override strategy as an alternative optional second template argument. On the other hand, there's also no urgency (so the change to replace the defaults can be performed right now and the change to override everything could be retrofitted later on), although considering this now does indicate that the idea of using a traits implementation, as partially shown by some disabled code in my proposed implementation above, is likely to run into trouble, and should probably not be pursued further, which seems like a useful outcome from a few moments of thinking ahead.

0 Kudos
RafSchietekat
Valued Contributor III
597 Views

Ramblings of a technical nature...

value_type is not inherited because it is a dependent type (or somesuch reasoning). It can either be defined over and over, or one could just provide a private typedef "super" (already used for __TBB_ATOMIC_CONSTRUCTORS boilerplate) and then have the boilerplate "using typename super::value_type;". Otherwise value_type can probably be eliminated from the inherited types, and T used directly (for brevity). I would get rid of I in any case and just use T instead (the affected comment can easily refer to "referent type").

There is a remark about providing an atomic 64-bit constructor even in a 32-bit environment. But it seems inappropriate to single out this width: the constructor is always unsafe, but this is no different from the constructor of, e.g., std::shared_ptr, whose instances are not thread-safe. More specifically, you may not take a reference to a shared_ptr instance, and modify it from one thread while accessing it from one or more other threads at the same time. The same goes for the above-mentioned copy construction, which is only allowed from an instance that is not concurrently being modified. Of course, after a safe copy has been made, it is possible to safely use the shared_ptr instance (where any thread that wants to modify the instance has to have exclusive access), even as other threads might be adding or dropping references. It seems appropriate to apply similar logic to tbb::atomic: copying must be done safely (identically to shared_ptr), and only then can the atomic be changed concurrently (unlike shared_ptr instances, but that's what an atomic is for). I haven't checked what C++11 does, but of course it has the benefit of new constructor-related functionality that might be of use here, and if tbb::atomic were to provide similar functionality it would have to be restricted to a C++11 environment. Unfortunately, without C++11, there's no way to disable the implicit copy constructor without also disabling zero initialization at translation time, so it seems perhaps a bit risky to allow such use until pre-C++11 environments are only a distant memory, and the documentation should probably explicitly mention that even with C++11 it is still not allowed.

As a separate issue, there's no reason not to introduce fetch_and_{and,or,xor}. There's even an opportunity to do one better than C++11, by also having versions without the fetch, which can make use of efficient locked operations on x86, although providing such an implementation would take some effort. But at least the user would already be able to write the code.

0 Kudos
jimdempseyatthecove
Honored Contributor III
597 Views

It appears you are ready for TSX and/or RTM to come to the rescue.

>>As a separate issue, there's no reason not to introduce fetch_and_{and,or,xor}.

Agreed

>>also having versions without the fetch

Seeing that the (locked) fetch occurs in any event, its only benefit would be in relieving register pressure when the fetch value is not required afterwards (e.g. bit table of done flags). There would be no difference in performance.

Jim Dempsey

0 Kudos
RafSchietekat
Valued Contributor III
597 Views

Jim, I'm not sure whether there's a significant performance difference between LOCK ADD and LOCK XADD, and indeed that's not a high priority, but it is the only operation in the set {add, sub, and, or, xor} that has such an exchange version. Oh yes, I forgot that there's also no fetch_and_sub() at the moment, just the decrement by 1, but anything to do with subtraction is easily transformed into addition at the level of C++.

However, for those proposed logical operations there's no such solution, necessitating a CAS loop, and that seems unnecessarily expensive against a LOCK'ed version (as in __TBB_AtomicAND() and __TBB_AtomicOR()) if the previous value is not needed. It therefore seems more serious to replace such LOCK'ed versions with CAS loops than to not (immediately) provide "LOCK ADD" or "LOCK SUB" in addition to the current "LOCK XADD", the goal being to replace ad-hoc functions with tbb::atomic wherever possible (clean code, eat your own dog food, ...).

That's why I would propose to have separate non-fetch logical operations (and for orthogonality the same for the arithmetic operations).

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
597 Views

Raf,

On IA32 and Intel64 you can have a memory address as destination for AND, OR, XOR. Therefore if you do not need the return value, the LOCK variation can be used (the flags can be used). For single bit manipulation you have the various BTx (bit test and do x) operations.

>>I would propose to have separate non-fetch logical operations

IOW

void operator|=(){...}

as well as

T operator |=(){...}

0 Kudos
RafSchietekat
Valued Contributor III
597 Views

Indeed, only addition has an exchange version in addition to a non-exchange version (no pun intended). tbb::atomic uses LOCK XADD (exchange version), while __TBB_AtomicAND() and __TBB_AtomicOR()​ respectively use LOCK AND and LOCK OR (non-exchange version).

The operator is a difficult matter: it has no name to differentiate between returning a value or not (unfortunately C++ does not provide that information from the way it is used), and, while it would be natural to provide a return value (like the built-in operator), that also requires the more expensive implementation (which in many uses would go to waste). Maybe it would be easiest to have Intel complete the ISA with those missing instructions. :-)

But this will take some studying to come up with an elegant design, and for now I'd rather concentrate on the original topic of this forum thread.

Using the idea in #5 I've simplified my proposal somewhat to, e.g.,

//! Specialization for atomic<T*> with arithmetic and operator->.
template<typename T, typename MD>
struct atomic<T*,MD>: internal::atomic_impl_with_arithmetic<T*,ptrdiff_t,T,MD> {
    private:
        typedef internal::atomic_impl_with_arithmetic<T*,ptrdiff_t,T,MD> super;
    __TBB_ATOMIC_BOILERPLATE
    public:
        T* operator->() const {
            return this->load();
        }
};

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
597 Views

I am not sure I fully understand your solution much less the problem statement.

In #9 proposal it appears that you want to make the pointer atomic as opposed to the pointee (or am I missing something?).

Assuming you want the pointer to be obtained for use with an atomic operation (assuming no aliasing), what I found handy is sketch code

T* pSomewhere;

pSomewhere = getPointerAndLock(&someResource);

do your protected thing here.

someResource = pSomewhere; // unlock

Essentially getPointerAndLock performs an XCHG or CAS of a memory location with value that is interpreted as a locked pointer. The function returns when you have exclusive access (or potentially returns NULL if the original pointer were NULL). Effectively the pointer serves as MUTEX. Now you may want variations of the function (exclusive use, write use with read permissions, etc...). Note, this is a lighter weight construct than reference counted pointers.

Jim

0 Kudos
RafSchietekat
Valued Contributor III
597 Views

Let me try to make that abundantly clear:

The "problem statement" is in the title of this thread ("pluggable" refers to supplying a struct listing those semantics at declaration time, not "configurable" for all atomics at the same time), with tbb::default_memory_semantics::tbb as the default class template parameter obtaining exactly the existing behaviour. The reference to new operations was explicitly called a "separate issue".

In #9, "operator" first referred to such new "bitwise logical operators" (as I should have called them). Then I got back to the original subject and chose what happens to be the only example with some additional stuff, a "->" operator (how is that officially called?), after __TBB_ATOMIC_BOILERPLATE.

(BTW, with full disclosure, introduction of those macros was prompted by some mistakes I had made that I noticed during review, so I tried to make things simpler (less prone to error, more obviously correct, more compact, ...), and as it happens the only other thing ultimately needed was a declaration of "super" (the type cannot be passed directly to a macro anyway, because of some commas). But the introduction of those macros is not essential to my proposal and can easily be rolled back; I just happen to prefer code-time certainty over test-time debuggability of imagined remaining problems, and I don't understand why people want to self-inflict the trouble of not using automatic code generation...)

I hope that also allows you to "fully understand [my] solution"?

The technique mentioned in #10 would be entirely orthogonal to any of that: a pointer atomic does not, by itself, do anything for the referent, nor should it.

0 Kudos
RafSchietekat
Valued Contributor III
597 Views

Of course I'm the one who first mentioned it (even though I called it a "separate issue"), but it's still strange to add bitwise logical operations to the original subject of this thread...

Could it be that C++11 goofed yet a third time with atomics alone? From what I observe in N3337, and it's not entirely unambiguous so I may be misinterpreting things, atomic<bool> is not supposed to have logical operations! That makes no sense at all!

In the new version I'm working on, the intention is to give bool logical operations (still wondering whether to disallow non-bool arguments to avoid conversion surprises), pointer arithmetic operations as before, and other integral types arithmetic operations as before plus bitwise logical operations. Maybe there should also be a pluggable binary operation? The current plan is to use macros exported from include/tbb/machine/*.h to potentially redefine specific operations/operators in an intermediate struct/class, which should allow x86 to improve over the expensive variants of nonfetch_and(), etc.

So far the following works (without any overrides yet, but I'm using macros to declare and/or/xor all at the same time based on standard functors, and that should also make it easy to fork the inheritance structure):

    // TODO: tests for bitwise logical operations
    {
        tbb::atomic<int> a;
        a.store((1<<12)+(1<<8)+(1<<4)+(1<<0));

        int result = a.fetch_and_and((15<<12)+(15<<8)+(15<<4));
        ASSERT(result == (1<<12)+(1<<8)+(1<<4)+(1<<0), NULL);
        result = a.load();
        ASSERT(result == (1<<12)+(1<<8)+(1<<4), NULL);

        a.nonfetch_and((15<<12)+(15<<8)+(15<<0));
        result = a.load();
        ASSERT(result == (1<<12)+(1<<8), NULL);

        a &= (15<<12)+(15<<4)+(15<<0); // TODO: return current value or *this?
        result = a.load();
        ASSERT(result == (1<<12), NULL);
    }

(Added) Actually, while one should not shy away from macros if they can facilitate code generation, they should probably still be avoided as a substitute for anything that can be done directly in the language. For the selective override, I'm already thinking in terms of templates, instead.

(2014-10-11 Added) I'm not actively working on this at the moment, but just a thought: even though x86 implies full-fence memory semantics for read-modify-write operations, it still seems that there is room for more relaxed semantics. Or rather, fully relaxed does not require any compiler fence, and data can be kept in registers instead of being spilled out and reloaded. The question now is how much difference this actually makes for x86-64, if not for x64-32 (with 16 resp. 8 general-purpose registers). Here is also where compilers have an edge, because they may be able to keep things in registers across non-relaxed atomic operations, although I'm curious how much is being done beyond or even at the level of consume (which at least can be tracked locally). TBB atomics are at an obvious theoretical disadvantage here, but I have no idea how this translates into real-world performance differences. So the takeaway point is to have at least relaxed and "other" memory semantics. Right?

0 Kudos
jimdempseyatthecove
Honored Contributor III
597 Views

RE: (2014-10-11 Added) ...fully relaxed does not require any compiler fence

This would be permissible (possibly desirable) with the following restrictions:

a) The memory read/write/RMW sequence of the operation with respect to other atomic operations are maintained (IOW same as for volatile)

b) Any writes to said location cannot contain a composition including a read from that location unless it is an atomic RMW

Example:

tbb::atomic<int> Count; Count = 0;...

Count.inc; // inc without fetch allowed to be moved

Note, since result is not used, the atomic increment could be moved to later in the code.

localCount = Count.inc; // not permitted to separate load from ++

LocalCount = Count.fetch_and_inc; // permitted, but not movable

What you also might consider, something we cannot do, but the compiler writer can do is to implement a concept to permit reduction but not as an omp reduction

// *** fictitious language extension
atomic<int> Count; Count = 0; ...
...
#pragma omp parallel... // no reduction here
... code in parallel
// interior loop
for(int i = 0; i < N; ++i) :: reduce(++:Count) {
  if(QualifierFuncation(Array)
    ++Count; // ++ accumulated locally
} // Count is atomically += local count

Jim Dempsey

 

0 Kudos
RafSchietekat
Valued Contributor III
597 Views

I did not imply breaking the (spatial or) temporal atomicity.

0 Kudos
RafSchietekat
Valued Contributor III
597 Views

Here's an update, just for the original subject (no bitwise operations/operators yet).

Comments welcome!

0 Kudos
Reply