I still haven't found an accessible discussion about how those operations actually work. For example, if one thread does a release-write, why would that be more costly than just a compiler fence if the read-acquire happens to be on the same core even if that wasn't known before it so happened? Well, that's just out of curiosity at this point...
The best description to date is "Asymmetric Dekker Synchronization" by David Dice et al.
It's not about elimination of release/acquire fences, it's about elimination of #StoreLoad style fence (MFENCE). Release/acquire fences can be eliminated too, though. that's done in Linux kernel RCU. Check out:
http://lwn.net/Articles/253651/You may see how Paul McKenney use asymmetric synchronization to eliminate even release/acquire fences from reader side, compiler fences are still in place.
The technique basically allows you to "strip" hardware part from some fences, and leave only compiler part, then compensate hardware part by something else.