Arch is referring to ISO C++'s volatile which has nothing to do with multi-threading, memory fences and lock-free programming.
And you are probably referring to Microsoft Visual C++'s volatile which is basically promoted to the rank of multi-threading synchronization primitive.
I think that this is about flushing out the writes somehow. so that the other thread gets to see the new Ready value, which makesat least somesense to me. But I'm also in favour of using real atomics, and you can write those without a single "volatile" (I did, anyway), because they would be redundant with the inline-assembler store instructions. Still, what prevents the optimiser from reordering even those with later code, perhaps code that waits for a value that cannot appear before the store is seen by another thread, leading to deadlock? I checked my code again, and I thought that I had at least a compiler fence at both ends to prevent just that, but not so, apparently, and I now consider that an oversight. I'm aware that language-level atomics can do other kinds of things, if only the specification were readable...
(Added) "I did, anyway": or not yet... they're still there in tbb_machine.h.
volatile though does not assure the variable is aligned on a natural boundry
(natural boundry variables can be atomicaly R, W, or LOCK RMW).
volatile does assure that the compiler will always generate code to reference memory
Therefore atomic typed variables, when used internaly with volatile, are assured to atomically R, W, or LOCK RMW.
volatile alone typed variables are not assured to be aligned on natural boundry, therefore are NOT assured to atomically R, W, or LOCK RMW...
...UNLESS the programmer takes caution to enforce alignment rules on such declared variables.
Some of these optimizations may interfere with multi-threaded programming.
line = 1;
line = 2;
compiler is permitted to remove first statement. Therefore, if other thread is monitoring progress, it will never see line=1.
Be cautious in assuming atomic
volatile int line = 0;
line = 1;
line = 2;
compiler is NOT permitted to remove first statement.
however, it is not required to align line on natural boundry.
A stack local int variable will tend to be naturally aligned.
A structure member variable is not assured to be naturally aligned
A a programmer, you are required to assure alignment (when atomnicity is required).
The RMW operations require specific assembler instructions, but the existing TBB implementation does not bother with those for ordinary loads and stores on x86 (except for 8-byte data) or x64, that's true. I don't agree with that, though, even if it happens to work. I think that some existing code would break if volatile were taken out, so that decision is not so easy to make.
>>The RMW operations require specific assembler instructions
Correct - these (RMW)are provided with compiler intrinsics (or inline assembler).
So atomic will "hide" the nasties
However, use of atomic will require at times
or whatever the end member function ends up being called
(or in lieu of member function some new keyword/directive as in C++0x)
where volatile can use
provided you are also careful to correctly align the volatile variable.
atomic variables are good and thread safe
BUT the behavior may not necessarily produce the desired result.
The use cautions are not clearly documented, at least to the point of presenting a clear picture up front.
The same issue is involved with volatile with respect to alignment/atomnicity
In both cases the usage examples should include "***WARNING***" when programming xxx use yyy technique.
I didn't spot that yet?
"BUT the behavior may not necessarily produce the desired result."
Specifically? The compiler's coalescing optimisation you mentioned (let's discount statistics), or something else (too)?
"The same issue is involved with volatile with respect to alignment/atomnicity"
Such use seems dubious (not portable), even ignoring memory semantics.
Specifically the coalescing optimization. Which may result in coalescing across a lengthy loop. And in which case can introduce unintended interlocks. (assuming you forgot to use the designated member function for atomic including the correct std::memory_order_....)
The atomic appears to be written from the perspective of and a bias towards the thread issuing the statements as opposed to the threads observing the results. Whereas volatile appears impartial.
Coalescing optimizations are not always good
Assume you are on a processor with HT capability
Assume the hardware PREFETCHn instruction is either not implementen or not doing what you want it to do.
You can split the thread processing into one that does the work, and a second that monitors the progress of a volatile variable(s) (non-coalescing). This second thread can be "brainless" so to speak and simply performs memory moves to a register to the addresses pointed to by each of the volatile variable(s) should they change (and include _mm_pause).
Essentialy the second HT thread does no work except for assuring soon to be accessed data is fresh and ready in L1 cache. This will work across page boundries as well as through page faults.
In the above scenario coalescing optimizations will thwart the intentions of the programmer.
The above is also a simple example of synchronization through a memory mailbox coalescing optimizations interferes with such synchronization. The two (or more) cannot keep in lock-step if some of the steps cannot be observed.
Unless you come up against a fiercely optimising compiler that won't let go even after linking, a portable compiler fence would require nothing more than a call to an external empty function. In g++, you could do without the hassle of another source file, however trivial, and the overhead of making that call, however small, by using a blank inline assembler call that "clobbers" all memory. At least, these are commonly understood to provide the functionality you seek, at least between operations that work on data that is not just locally visible, unless the g++ option even frees you from that restriction.