Community
cancel
Showing results for 
Search instead for 
Did you mean: 
e4lam
Beginner
83 Views

store_with_release() and load_with_acquire() on VC8

Jump to solution
On VC8, I see that __TBB_store_with_release() and __TBB_load_with_acquire() are both implemented with _ReadWriteBarrier(). Having just learned about memory barriers and such, I'm have a question about this. Can __TBB_store_with_release() use a _WriteBarrier() barrier instead and similarly _ReadBarrier() for __TBB_load_with_acquire() ?

Thanks!
0 Kudos
25 Replies
Dmitry_Vyukov
Valued Contributor I
4 Views
Quoting - Raf Schietekat
#14 As for my reaction to the MS-specific treatment of "volatile", that's just because it's so much easier to infect code by changing the meaning of an existing keyword than by the use of a new construct that would cause a compilation error elsewhere.


Agree.
It may make porting of MSVC code to other platforms quite problematic.
The better way would be to finally implement something along the lines of std::atomic<>.
Dmitry_Vyukov
Valued Contributor I
4 Views
Quoting - Raf Schietekat
I still haven't found an accessible discussion about how those operations actually work. For example, if one thread does a release-write, why would that be more costly than just a compiler fence if the read-acquire happens to be on the same core even if that wasn't known before it so happened? Well, that's just out of curiosity at this point...

The best description to date is "Asymmetric Dekker Synchronization" by David Dice et al.

It's not about elimination of release/acquire fences, it's about elimination of #StoreLoad style fence (MFENCE). Release/acquire fences can be eliminated too, though. that's done in Linux kernel RCU. Check out:
http://lwn.net/Articles/253651/
You may see how Paul McKenney use asymmetric synchronization to eliminate even release/acquire fences from reader side, compiler fences are still in place.
The technique basically allows you to "strip" hardware part from some fences, and leave only compiler part, then compensate hardware part by something else.

RafSchietekat
Black Belt
4 Views
#18 Er, why did I write that? I have no idea... sorry, please ignore.

#19 Except that there is no compiler barrier where the #StoreLoad comment is, and the acquire and release barriers are the same... oh well.

#20 So, if compiler-only fences are such a good idea, why haven't I seen them in TBB or C++0x, and why haven't they come up in a discussion here before? An oversight to be corrected?

#21 Superior performance through an entirely different approach may be (highly) preferable, where feasible, but I was still wondering about the implementation of compiler-only vs. machine-level memory fences etc. Perhaps the hardware could somehow dynamically detect that everything occurs on the same core, and avoid cache-related external chatter?

But maybe I should drop the subject: my atomics proposal seems to be dead and buried, and revving up with a disengaged clutch is said to be bad for the engine...
Dmitry_Vyukov
Valued Contributor I
4 Views
Quoting - Raf Schietekat
#19 Except that there is no compiler barrier where the #StoreLoad comment is, and the acquire and release barriers are the same... oh well.

It's a known issue :)
I am in process of writing lengthy detailed description of asymmetric synchronization, however I do not know how long it will take... probably months... and in process I may completely lost any interest, so it may actually not appear at all :(

Dmitry_Vyukov
Valued Contributor I
4 Views
Quoting - Raf Schietekat
#20 So, if compiler-only fences are such a good idea, why haven't I seen them in TBB or C++0x, and why haven't they come up in a discussion here before? An oversight to be corrected?

Humm... in order to correct this I would suggest you to re-read C++0x draft, especially the part related to std::atomic_signal_fence() (which was previously called std::compiler_fence()) :)


Dmitry_Vyukov
Valued Contributor I
4 Views
Quoting - Raf Schietekat
#21 Superior performance through an entirely different approach may be (highly) preferable, where feasible, but I was still wondering about the implementation of compiler-only vs. machine-level memory fences etc. Perhaps the hardware could somehow dynamically detect that everything occurs on the same core, and avoid cache-related external chatter?

Hardware indeed avoids cache-coherence related traffic for data accessed from one core.
However there are still overheads related to instruction ordering - pipeline dump, store buffer dump.