Solved: Re: store_with_release() and load_with_acquire() on VC8 - Page 2

e4lam · ‎11-01-2009

On VC8, I see that __TBB_store_with_release() and __TBB_load_with_acquire() are both implemented with _ReadWriteBarrier(). Having just learned about memory barriers and such, I'm have a question about this. Can __TBB_store_with_release() use a _WriteBarrier() barrier instead and similarly _ReadBarrier() for __TBB_load_with_acquire() ?

Thanks!

Dmitry_Vyukov · ‎11-01-2009

Quoting - e4lam

On VC8, I see that __TBB_store_with_release() and __TBB_load_with_acquire() are both implemented with _ReadWriteBarrier(). Having just learned about memory barriers and such, I'm have a question about this. Can __TBB_store_with_release() use a _WriteBarrier() barrier instead and similarly _ReadBarrier() for __TBB_load_with_acquire() ?

No, they can't.
Read barrier is a kind of orthogonal to acquire barrier. While acquire barrier prevents all memory accesses (i.e. both reads and writes) to hoist above the load, read barrier prevents reads on one side of the barrier to intermix with reads on the other side of the barrier. The same for write barrier.

However, IMHO, fine-grained precise compiler fences are mostly useless, because they affect only compiler, so have basically zero run-time cost. So IMHO it's Ok to put the strongest full compiler fence everywhere.

View solution in original post

Dmitry_Vyukov · ‎11-05-2009

Quoting - Raf Schietekat

I still haven't found an accessible discussion about how those operations actually work. For example, if one thread does a release-write, why would that be more costly than just a compiler fence if the read-acquire happens to be on the same core even if that wasn't known before it so happened? Well, that's just out of curiosity at this point...

The best description to date is "Asymmetric Dekker Synchronization" by David Dice et al.

It's not about elimination of release/acquire fences, it's about elimination of #StoreLoad style fence (MFENCE). Release/acquire fences can be eliminated too, though. that's done in Linux kernel RCU. Check out:
http://lwn.net/Articles/253651/
You may see how Paul McKenney use asymmetric synchronization to eliminate even release/acquire fences from reader side, compiler fences are still in place.
The technique basically allows you to "strip" hardware part from some fences, and leave only compiler part, then compensate hardware part by something else.

RafSchietekat · ‎11-05-2009

#18 Er, why did I write that? I have no idea... sorry, please ignore.

#19 Except that there is no compiler barrier where the #StoreLoad comment is, and the acquire and release barriers are the same... oh well.

#20 So, if compiler-only fences are such a good idea, why haven't I seen them in TBB or C++0x, and why haven't they come up in a discussion here before? An oversight to be corrected?

#21 Superior performance through an entirely different approach may be (highly) preferable, where feasible, but I was still wondering about the implementation of compiler-only vs. machine-level memory fences etc. Perhaps the hardware could somehow dynamically detect that everything occurs on the same core, and avoid cache-related external chatter?

But maybe I should drop the subject: my atomics proposal seems to be dead and buried, and revving up with a disengaged clutch is said to be bad for the engine...

Dmitry_Vyukov · ‎11-06-2009

Quoting - Raf Schietekat

#19 Except that there is no compiler barrier where the #StoreLoad comment is, and the acquire and release barriers are the same... oh well.

It's a known issue :)
I am in process of writing lengthy detailed description of asymmetric synchronization, however I do not know how long it will take... probably months... and in process I may completely lost any interest, so it may actually not appear at all :(

Dmitry_Vyukov · ‎11-06-2009

Quoting - Raf Schietekat

#20 So, if compiler-only fences are such a good idea, why haven't I seen them in TBB or C++0x, and why haven't they come up in a discussion here before? An oversight to be corrected?

Humm... in order to correct this I would suggest you to re-read C++0x draft, especially the part related to std::atomic_signal_fence() (which was previously called std::compiler_fence()) :)

Dmitry_Vyukov · ‎11-06-2009

Quoting - Raf Schietekat

#21 Superior performance through an entirely different approach may be (highly) preferable, where feasible, but I was still wondering about the implementation of compiler-only vs. machine-level memory fences etc. Perhaps the hardware could somehow dynamically detect that everything occurs on the same core, and avoid cache-related external chatter?

Hardware indeed avoids cache-coherence related traffic for data accessed from one core.
However there are still overheads related to instruction ordering - pipeline dump, store buffer dump.