Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2464 Discussions

store_with_release() and load_with_acquire() on VC8

e4lam
Beginner
945 Views
On VC8, I see that __TBB_store_with_release() and __TBB_load_with_acquire() are both implemented with _ReadWriteBarrier(). Having just learned about memory barriers and such, I'm have a question about this. Can __TBB_store_with_release() use a _WriteBarrier() barrier instead and similarly _ReadBarrier() for __TBB_load_with_acquire() ?

Thanks!
0 Kudos
1 Solution
Dmitry_Vyukov
Valued Contributor I
934 Views
Quoting - e4lam
On VC8, I see that __TBB_store_with_release() and __TBB_load_with_acquire() are both implemented with _ReadWriteBarrier(). Having just learned about memory barriers and such, I'm have a question about this. Can __TBB_store_with_release() use a _WriteBarrier() barrier instead and similarly _ReadBarrier() for __TBB_load_with_acquire() ?

No, they can't.
Read barrier is a kind of orthogonal to acquire barrier. While acquire barrier prevents all memory accesses (i.e. both reads and writes) to hoist above the load, read barrier prevents reads on one side of the barrier to intermix with reads on the other side of the barrier. The same for write barrier.

However, IMHO, fine-grained precise compiler fences are mostly useless, because they affect only compiler, so have basically zero run-time cost. So IMHO it's Ok to put the strongest full compiler fence everywhere.

View solution in original post

0 Kudos
25 Replies
Dmitry_Vyukov
Valued Contributor I
935 Views
Quoting - e4lam
On VC8, I see that __TBB_store_with_release() and __TBB_load_with_acquire() are both implemented with _ReadWriteBarrier(). Having just learned about memory barriers and such, I'm have a question about this. Can __TBB_store_with_release() use a _WriteBarrier() barrier instead and similarly _ReadBarrier() for __TBB_load_with_acquire() ?

No, they can't.
Read barrier is a kind of orthogonal to acquire barrier. While acquire barrier prevents all memory accesses (i.e. both reads and writes) to hoist above the load, read barrier prevents reads on one side of the barrier to intermix with reads on the other side of the barrier. The same for write barrier.

However, IMHO, fine-grained precise compiler fences are mostly useless, because they affect only compiler, so have basically zero run-time cost. So IMHO it's Ok to put the strongest full compiler fence everywhere.

0 Kudos
RafSchietekat
Valued Contributor III
787 Views
"No, they can't."
I would say that the answer is yes, but maybe you know something that I don't (or that I have forgotten again)?

"Read barrier is a kind of orthogonal to acquire barrier. While acquire barrier prevents all memory accesses (i.e. both reads and writes) to hoist above the load, read barrier prevents reads on one side of the barrier to intermix with reads on the other side of the barrier. The same for write barrier."
Can you quote the specification for these functions (maybe _ReadBarrier(), _WriteBarrier() and _ReadWriteBarrier() are all just compiler fences?), and clarify what you mean exactly with "hoist" and "intermix" (maybe "hoist" for C++ vs. execution and "intermix" for C++ vs. machine code?)?

"However, IMHO, fine-grained precise compiler fences are mostly useless, because they affect only compiler, so have basically zero run-time cost. So IMHO it's Ok to put the strongest full compiler fence everywhere."
Even if they only affect the compiler without causing any specific instruction to be emitted (on a specific architecture, notably x86!), their cost and/or effect may not be zero, because they could, at least conceivably, be preventing an optimisation reordering that would otherwise corrupt the program, so I wouldn't call them "useless" (that may be clear to you, but you have to keep your audience in mind when you write such things). By the same logic, perhaps a weaker compiler fence might allow a "partial optimisation" to still occur (subject to testing), so indiscriminately putting the strongest compiler fence everywhere might not be appropriate, even if it would be a conservative approximation (conserving correctness, I mean).
0 Kudos
Dmitry_Vyukov
Valued Contributor I
787 Views
Quoting - Raf Schietekat
"No, they can't."
I would say that the answer is yes, but maybe you know something that I don't (or that I have forgotten again)?

"Read barrier is a kind of orthogonal to acquire barrier. While acquire barrier prevents all memory accesses (i.e. both reads and writes) to hoist above the load, read barrier prevents reads on one side of the barrier to intermix with reads on the other side of the barrier. The same for write barrier."
Can you quote the specification for these functions (maybe _ReadBarrier(), _WriteBarrier() and _ReadWriteBarrier() are all just compiler fences?), and clarify what you mean exactly with "hoist" and "intermix" (maybe "hoist" for C++ vs. execution and "intermix" for C++ vs. machine code?)?


Of course:
http://www.google.com/search?q="_readbarrier"+"_writebarrier"

Since here is a link for official documentation, please ignore my "hoist" and "intermix" at this point.


Quoting - Raf Schietekat
"However, IMHO, fine-grained precise compiler fences are mostly useless, because they affect only compiler, so have basically zero run-time cost. So IMHO it's Ok to put the strongest full compiler fence everywhere."
Even if they only affect the compiler without causing any specific instruction to be emitted (on a specific architecture, notably x86!), their cost and/or effect may not be zero, because they could, at least conceivably, be preventing an optimisation reordering that would otherwise corrupt the program, so I wouldn't call them "useless" (that may be clear to you, but you have to keep your audience in mind when you write such things). By the same logic, perhaps a weaker compiler fence might allow a "partial optimisation" to still occur (subject to testing), so indiscriminately putting the strongest compiler fence everywhere might not be appropriate, even if it would be a conservative approximation (conserving correctness, I mean).


I am quite skeptical regarding their practical usefulness. I would be interesting to see some (at least synthetic) show-case for fine-grained compiler fences where finer-grained fence makes significant difference over coarser-grained fence. May you construct a one?

0 Kudos
RafSchietekat
Valued Contributor III
787 Views
The specification from Microsoft is quite unsatisfactory (so is it a compiler fence, or isn't it? and will _ReadWriteBarrier() keep a write before a read?), but the mentioning of specific hardware architectures at least seems to imply that on specific architectures any necessary machine instructions will be issued.

I have no ambition to demonstrate any real difference, let alone a significant one, but how are you going to prove a negative...
0 Kudos
Dmitry_Vyukov
Valued Contributor I
787 Views
Quoting - Raf Schietekat
The specification from Microsoft is quite unsatisfactory (so is it a compiler fence, or isn't it? and will _ReadWriteBarrier() keep a write before a read?), but the mentioning of specific hardware architectures at least seems to imply that on specific architectures any necessary machine instructions will be issued.

I have no ambition to demonstrate any real difference, let alone a significant one, but how are you going to prove a negative...

Yes, the documentation is unsatisfactory.
_ReadWriteBarrier() will keep a write before a read.
_Read/_Write/_ReadWriteBarrier() are compiler only fences (see http://msdn.microsoft.com/en-us/library/ms684208%28VS.85%29.aspx).

I can't prove the opposite. Proving negative things are usually more problematic because I must test ALL cases, and you must find just one...

0 Kudos
RafSchietekat
Valued Contributor III
787 Views
"_ReadWriteBarrier() will keep a write before a read."
How could that possibly be useful without a hardware fence?

"_Read/_Write/_ReadWriteBarrier() are compiler only fences (see http://msdn.microsoft.com/en-us/library/ms684208%28VS.85%29.aspx)."
Ah, look: "The _ReadBarrier, _WriteBarrier, and _ReadWriteBarrier compiler intrinsics prevent compiler re-ordering only." Obviously in the documentation about these functions/intrinsics themselves you don't have such a statement... assuming this one is correct, of course. So, here we have the heaviest fence of all, with just the generic name MemoryBarrier() for your confusion, to be avoided if at all possible, but the documentation doesn't tell you that, and there's no reference in sight to a cheaper alternative for use where needed... Not very nice at all. So how should one implement __TBB_store_with_release() and __TBB_load_with_acquire() so that it doesn't break down on other architectures than x86/x64?

"I can't prove the opposite. Proving negative things are usually more problematic because I must test ALL cases, and you must find just one..."
If you think there's no cost anyway, then that's all the more reason to be conservative instead of avoiding the use of those functions/intrinsics.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
787 Views
Quoting - Raf Schietekat
"_ReadWriteBarrier() will keep a write before a read."
How could that possibly be useful without a hardware fence?

I am aware of at least 3 practical use cases:
1. Interaction between a thread and a UNIX signal handler.
2. Interaction between threads running on the same processor.
3. Interaction between arbitrary threads when hardware fences are provided by other means.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
787 Views
Quoting - Raf Schietekat
"_Read/_Write/_ReadWriteBarrier() are compiler only fences (see http://msdn.microsoft.com/en-us/library/ms684208%28VS.85%29.aspx)."
Ah, look: "The _ReadBarrier, _WriteBarrier, and _ReadWriteBarrier compiler intrinsics prevent compiler re-ordering only." Obviously in the documentation about these functions/intrinsics themselves you don't have such a statement... assuming this one is correct, of course. So, here we have the heaviest fence of all, with just the generic name MemoryBarrier() for your confusion, to be avoided if at all possible, but the documentation doesn't tell you that, and there's no reference in sight to a cheaper alternative for use where needed... Not very nice at all. So how should one implement __TBB_store_with_release() and __TBB_load_with_acquire() so that it doesn't break down on other architectures than x86/x64?

Just mark the variable as volatile. That's all.

0 Kudos
RafSchietekat
Valued Contributor III
787 Views
Quoting - Dmitriy Vyukov
I am aware of at least 3 practical use cases:
1. Interaction between a thread and a UNIX signal handler.
2. Interaction between threads running on the same processor.
3. Interaction between arbitrary threads when hardware fences are provided by other means.
Really?
1. Maybe, but I don't know what the issues are here.
2. Can probably be disregarded because obsolete.
3. You wouldn't be able to meaningfully combine them with _ReadWriteBarrier(), is what I'm saying.
0 Kudos
RafSchietekat
Valued Contributor III
787 Views
Quoting - Dmitriy Vyukov
Just mark the variable as volatile. That's all.
I'll pretend I didn't see that.

(Added) Literally: don't do that unless it's well encapsulated and won't infect the rest of the program with Microsoft-onliness.

(Added) And why would the compiler add machine instructions without applying the accompanying compiler fence? That makes no sense at all.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
787 Views
Quoting - Raf Schietekat
Really?
1. Maybe, but I don't know what the issues are here.
2. Can probably be disregarded because obsolete.
3. You wouldn't be able to meaningfully combine them with _ReadWriteBarrier(), is what I'm saying.

Well, what can I say... I am a bit confused... I can go in deep details regarding each point... however, Raf, don't you trolling on this?

0 Kudos
RafSchietekat
Valued Contributor III
787 Views
Quoting - Dmitriy Vyukov
Well, what can I say... I am a bit confused... I can go in deep details regarding each point... however, Raf, don't you trolling on this?
That's an unfair assumption, but of course you're not obliged to continue this.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
787 Views
Quoting - Raf Schietekat
That's an unfair assumption, but of course you're not obliged to continue this.

In short:
1. You need only compiler fences here, basically you need to 'strip' hardware part from a fence. Since a thread part and a signal part are executed on a single OS/hardware thread, there is no issue of hardware ordering.
2. It's not obsolete. You can bind two or more threads to a single processor, which is somehow reasonable for low-level parallelism support libraries like TBB. Then you need only compiler part of fences too.
3. Me and not only me are indeed able combine them in a meaningful way. Check out Joe Seigh's SMR+RCU:
http://lkml.indiana.edu/hypermail/linux/kernel/0505.1/0252.html
or David Dice et et Asymmetric Dekker Synchronization:
http://home.comcast.net/~pjbishop/Dave/Asymmetric-Dekker-Synchronization.txt
or my Asymmetric Reader-Writer Mutex:
http://groups.google.com/group/lock-free/browse_frm/thread/1efdc652571c6137




0 Kudos
Dmitry_Vyukov
Valued Contributor I
787 Views
Quoting - Raf Schietekat
I'll pretend I didn't see that.

(Added) Literally: don't do that unless it's well encapsulated and won't infect the rest of the program with Microsoft-onliness.

(Added) And why would the compiler add machine instructions without applying the accompanying compiler fence? That makes no sense at all.

Since you consider MS volatiles as a replacement for MS _ReadWriteBarrier(), MS-onliness is not an issue at all. Anyway for now (until C++0x) on every platform you will have to fall onto platform-specific level, so I do not see how you can get better than that anyway.

MS volatiles provide both compiler and hardware ordering. Only hardware fences do not make any sense. MS guys understand this.

0 Kudos
RafSchietekat
Valued Contributor III
787 Views
See, how else would I have obtained those specific links without trawling the whole Internet? :-) Thanks, I'll do some reading tonight, and maybe tomorrow some more trolling.

0 Kudos
e4lam
Beginner
787 Views
Thanks for the replies!
0 Kudos
RafSchietekat
Valued Contributor III
787 Views
Dmitriy, sorry for the delayed response.

#13 No, I don't see it. Or maybe it's a misunderstanding. I'm not aware of any bidirectional machine-level memory fences, so why would there be compiler-level ones? Isn't the real meat in the atomic operation, sided by necessarily asymmetric fences, on one side or both? That would go for 1 and 2. I couldn't find any mention of "compiler fence" in the first two references for 3, and in your own example the uses of _ReadWriteBarrier() are even commented as either acquire or release, so why not use _ReadBarrier() and _WriteBarrier() instead?

#14 As for my reaction to the MS-specific treatment of "volatile", that's just because it's so much easier to infect code by changing the meaning of an existing keyword than by the use of a new construct that would cause a compilation error elsewhere.

I still haven't found an accessible discussion about how those operations actually work. For example, if one thread does a release-write, why would that be more costly than just a compiler fence if the read-acquire happens to be on the same core even if that wasn't known before it so happened? Well, that's just out of curiosity at this point...
0 Kudos
Dmitry_Vyukov
Valued Contributor I
787 Views
Quoting - Raf Schietekat
Dmitriy, sorry for the delayed response.

#13 No, I don't see it. Or maybe it's a misunderstanding. I'm not aware of any bidirectional machine-level memory fences, so why would there be compiler-level ones?

As for hardware bidirectional fences, check out membar #LoadLoad, membar #StoreStore on SPARC RMO, and SFENCE, LFENCE on x86.
I believe they are actually same useful and same widespread as uni-directional fences.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
787 Views
Quoting - Raf Schietekat
Isn't the real meat in the atomic operation, sided by necessarily asymmetric fences, on one side or both? That would go for 1 and 2. I couldn't find any mention of "compiler fence" in the first two references for 3, and in your own example the uses of _ReadWriteBarrier() are even commented as either acquire or release, so why not use _ReadBarrier() and _WriteBarrier() instead?


The compiler barrier must be in the same place where you would normally put #StoreLoad fence. In my asymmetric mutex you can find that place by "no explicit #StoreLoad" comment.

Barriers that are commented as acquire and release are different barriers, they are not relevant for the discussion.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
685 Views
Quoting - Raf Schietekat
#14 As for my reaction to the MS-specific treatment of "volatile", that's just because it's so much easier to infect code by changing the meaning of an existing keyword than by the use of a new construct that would cause a compilation error elsewhere.


Agree.
It may make porting of MSVC code to other platforms quite problematic.
The better way would be to finally implement something along the lines of std::atomic<>.
0 Kudos
Reply