Memory Semantics and C++0x - Page 2

RafSchietekat · ‎03-21-2009

"Formal definitions of the memory model were rejected as unreadable by the vast majority of programmers." - Pthreads standard, as quoted in article mentioned here

I would like to review some issues related to the upcoming revision of the C++ standard, by looking at some of the relevant texts. A good first choice seems to be "Threads Cannot Be Implemented As a Library", a must-read seminal article (or in my case must-reread, because it is an important reference in another article that I'm reading now for the first time). For brevity, I will try to limit my comments to what immediately concerns me now: a potential middle ground between the naivety exposed in this article, and unreadably rigorous formality.

In "4.1 Concurrent modification", a speculative code transformation is outlined to show that the concept of "race" should be defined. However, when used with C++0x's relaxed atomic operations, formally there is no "race" (by its definition in C++0x), but nothing has really changed, and relaxed atomic operations are still considered problematic. In my proposed reimplementation of the TBB atomics library, I have used the qualifier "volatile" specifically to disallow speculative writes, relying on the current specification that describes the behaviour of a program partly in terms of how it treats "volatile" variables (details not repeated here). So, doesn't that solve the problem (and without any language change), or can it still reoccur on the hardware side? If so, what hardware would that be, and, e.g., would such hardware not already be problematic for "volatile" accesses that have nothing to do with multithreading?

In "4.2 Rewriting of Adjacent Data", a clear argument is made to define "memory locations", an unavoidable change.

In "4.3 Register promotion", it is made clear that such promotion has to be restricted, but I see nothing that should worry a programmer: determining which promotions are still safe seems strictly a compiler writer's problem.

In "5.1 Expensive Synchronization: An Example", a plea is made... to allow races!

So, enough to make you worry about using multithreading now, but only requiring simple changes, I would say, and not a hint yet of threatening that the behaviour of programs containing races is undefined (quite to the contrary).

RafSchietekat · ‎03-29-2009

#17 "How would you define behavior of the program? Provide formal specification which will completely define behavior."
That's silly: g_obj is a non-atomic that's being concurrently written and read (so thread 2 might theoretically read a partially updated value, even though that's unlikely to happen on current hardware), and it also might be written before the constructor has completed (so virtual_func() might be called before its time, and it being a virtual function makes things infinitely worse). Why would anyone need the proposed formal semantics to see that? The difference is that with the proposed formal semantics, I would give up reasoning, because I don't understand them (who does?).

"Note that the loaded value is completely defined here - we know that it's *right* pointer. The problem is deeper here, it's not only about separate undefined values. Anyway, if program contains some undefined values, does it make sense to reason about it behavior further? Undefined value will be used is some way in the program, so basically uncertainty will propagate through the program like the plague."
That must depend on what the undefined value can cause to happen, but I don't need a specification to predigest my conclusions there, especially not in the form of blanket hysteria.

"Well, yes, specification might provide 100 page description of all possible combinations of threads interaction and involved races and for some of them define behavior, for some of them define several possible behaviors, and for some of them still say just UB (see example above). Hmmm... was not you saying that specification must not be overloaded?"
There's no reason why an alternative specification would be 100 pages long. I still don't know what UB means. Remind me about what I said regarding "overloaded specification" and where?

"And what to do with virtual functions, etc? How would you define behavior which involves virtual functions provided that their implementation is implementation defined?"
Now that's a legitimate example of undefined behaviour of the entire program.

"If you are loading int and then pass it to the printf, then we may say that the output value will be undefined. And what if we are loading char*? You must be insane if you want to cope with this in the specification."
Nobody needs new semantics to know that printing an undefined int is merely useless and printing an undefined pointer as a C string is somewhat more problematic.

"Ok, probably not, show your variant of the spec. What I see todate is that IMHO you want to underspecify significant moments, overspecify insignificant moments and bind abstract specification to some particular implementation..."
I'm just testing the waters: reading documents, and testing whether anyone is responsive to my objections so far. If the answer there is negative, it would be pointless to proceed further, for lack of resources. The current proposal may not underspecify significant moments, but it definitely overspecifies insignificant moments (like printing an undefined integer).The current proposal both underspecifies (what happens when a program is linked with non-C(++) code?) and overspecifies (run for your lives: it's the terrible Undefined Integer!). I don't see why reasoning in terms of conceptual fences or whatever they might be called (that can be conservatively implemented on various architectures, or first guide program transformation) would be unsuitable.

(Correction 07:40 UTC) See text.

(Correction 2009-03-31) Well, both would be underspecifications, actually. So then what am I supposed to be underspecifying?

robert_jay_gould · ‎03-30-2009

Quoting - Dmitriy Vyukov

I am not sure I get you. So since the matter is too complicated in itself you proposing just to eliminate it from the specification and provide POSIX-like mutex-based simple model, right?

Not, exactly that. I'm saying that managing threads in "general terms" should be no more complicated than the POSIX-like mutex-based model. Place a big lock around the critical code and don't let anyone else touch it until you're done.

Most times this solution is perfectly acceptable. So I'd say locked code should not be affected at all by new standards, everyone (with a degree of competence and some experience) can understand locked code in its current state, and its a proven model. Any new feature that breaks this model is evil.

Lockless code/algorithms/containers/etc... is something the "masters" tell us common programmers to not try to write lockless constructs, because its very tricky, and even the "masters" get it wrong many times, so we've tried not to touch this stuff for many years (but I think there should be a way to democratize this too, although its probably not realistic).

However simple atomics as they happen to exist, with their "standard" quirks by empiric accident and/or design, at this point in time, have been in our (the common programmer's) field for a while, and we commonly rely on them for our tasks, and we have plenty of folklore about how they should behave. I think we don't want people messing with this folklore.

So the current accidental specs of atomics should not be taken away from us, they should just make this empiricalatomic the standard for C++0x.And if there is a need for a new tool make a new tool call them Bosons or Fermions for example.

Dmitry_Vyukov · ‎03-31-2009

Quoting - robert.jay.gould
"Not, exactly that. I'm saying that managing threads in "general terms" should be no more complicated than the POSIX-like mutex-based model. Place a big lock around the critical code and don't let anyone else touch it until you're done.
Most times this solution is perfectly acceptable. So I'd say locked code should not be affected at all by new standards, everyone (with a degree of competence and some experience) can understand locked code in its current state, and its a proven model. Any new feature that breaks this model is evil."

This model is not broken. Mutex-based synchronizationis is not affected by C++0x standard.

"Lockless code/algorithms/containers/etc... is something the "masters" tell us common programmers to not try to write lockless constructs, because its very tricky, and even the "masters" get it wrong many times, so we've tried not to touch this stuff for many years (but I think there should be a way to democratize this too, although its probably not realistic)."

Don't rely on memory barriers for synchronization... Only if you don't aware of Relacy Race Detector!

"However simple atomics as they happen to exist, with their "standard" quirks by empiric accident and/or design, at this point in time, have been in our (the common programmer's) field for a while, and we commonly rely on them for our tasks, and we have plenty of folklore about how they should behave. I think we don't want people messing with this folklore.
So the current accidental specs of atomics should not be taken away from us, they should just make this empiricalatomic the standard for C++0x.And if there is a need for a new tool make a new tool call them Bosons or Fermions for example."

The standard exactly makes those emphirical atomics the standard, i.e.:

std::atomic x;

x.compare_exchange_strong(cmp, xchg);

And exactly as you said, there is a new thing - "atomics with fine-grained memory ordering", i.e.

std::atomic x;

x.compare_exchange_weak(cmp, xchg, memory_order_acq_rel, memory_order_relaxed);

If you are trying to read formal language specification and fails to understand it, probably you better start with Hans Boehm non-formal articles:
http://www.hpl.hp.com/techreports/2008/HPL-2008-56.pdf
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2480.html
Here is complete list:
http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/

Note that simple sequentilly consistent model and relaxed model are clearly separated:

Simpler rules for programmers

Based on this definition, as shown in N2338, it becomes a theorem that programs using locks and atomic variables with default ("sequentially consistent") ordering to protect other shared variables from simultaneous access (other than simultaneous read access) behave as though they were executed by simply interleaving the actions of each thread. We expect that this is the rule that will be taught to most programmers.

Dmitry_Vyukov · ‎03-31-2009

"#17 "How would you define behavior of the program? Provide formal specification which will completely define behavior."
That's silly: g_obj is a non-atomic that's being concurrently written and read (so thread 2 might theoretically read a partially updated value, even though that's unlikely to happen on current hardware), and it also might be written before the constructor has completed (so virtual_func() might be called before its time, and it being a virtual function makes things infinitely worse). Why would anyone need the proposed formal semantics to see that? The difference is that with the proposed formal semantics, I would give up reasoning, because I don't understand them (who does?)."

Everyone who is not doing just cowboy programming needs formal specification to get explicit answer whether a program does not work or does work now and will be working on new compiler, hardware, OS, etc. That's formal needed for.

""Note that the loaded value is completely defined here - we know that it's *right* pointer. The problem is deeper here, it's not only about separate undefined values. Anyway, if program contains some undefined values, does it make sense to reason about it behavior further? Undefined value will be used is some way in the program, so basically uncertainty will propagate through the program like the plague."
That must depend on what the undefined value can cause to happen, but I don't need a specification to predigest my conclusions there, especially not in the form of blanket hysteria."

Do you see any sense or value in more detailed specification wrt this aspect? Frankly I do not. Even if I load just int which I am going to use to index statically allocated array, I can check that index does not exceed bounds, however I can not check that index is semantically correct. So what? Yes the program will not be crashing, but will transfer money to my account instead of your.
However if you are programmng against some particular platform then you indeed can figure out semantics of data races - they are typically defined for platform. But then how this relates to the international standard?

Btw, what about future platforms on which any data races will cause hardware interrupts, just as now integer overflows and unalgned data accesses? Have you thought about them?

UB is Undefined Bahavior.

"If you are loading int and then pass it to the printf, then we may say that the output value will be undefined. And what if we are loading char*? You must be insane if you want to cope with this in the specification."
Nobody needs new semantics to know that printing an undefined int is merely useless and printing an undefined pointer as a C string is somewhat more problematic.

What semantics are you talking about??? Standardbasically says that accessing undefined string is UB. Just in more general terms, just how abstrast standard must do.

"Ok, probably not, show your variant of the spec. What I see todate is that IMHO you want to underspecify significant moments, overspecify insignificant moments and bind abstract specification to some particular implementation..."
I'm just testing the waters: reading documents, and testing whether anyone is responsive to my objections so far. If the answer there is negative, it would be pointless to proceed further, for lack of resources. The current proposal may not underspecify significant moments, but it definitely overspecifies insignificant moments (like printing an undefined integer).The current proposal both underspecifies (what happens when a program is linked with non-C(++) code?) and overspecifies (run for your lives: it's the terrible Undefined Integer!). I don't see why reasoning in terms of conceptual fences or whatever they might be called (that can be conservatively implemented on various architectures, or first guide program transformation) would be unsuitable.

(Correction 07:40 UTC) See text.

(Correction 2009-03-31) Well, both would be underspecifications, actually. So then what am I supposed to be underspecifying?

Effects of calling external non C++ code must be defined by patform which provides such means. This is not resposibility of the C++ standard, it defines how C++ program behaves. Simple.

Dmitry_Vyukov · ‎03-31-2009

Quoting - Raf Schietekat

I don't see why reasoning in terms of conceptual fences or whatever they might be called (that can be conservatively implemented on various architectures, or first guide program transformation) would be unsuitable.

Ok, and how you would define simple example where one thread makes store-release and another thread makes load-acquire in terms of fences? Will not you revert to terms "happens-before" and "sequenced-before" (or analogs, or just replicating meaning of these terms everywhere)?

How would you define following code in terms of fences:

int data;

atomic flag;

// thread 1:

data = 1;

flag.store(1, release);

// thread 2:

if (flag.load(relaxed))

flag.store(2, relaxed);

// thread 3:

if (flag.load(acquire) == 2)

assert(data == 1);

?

Please, provide sketch of definitions "based on fences". Probably I am missing something, how this can be defined much simplier in terms of fences?

RafSchietekat · ‎03-31-2009

#25 "Ok, and how you would define simple example where one thread makes store-release and another thread makes load-acquire in terms of fences? Will not you revert to terms "happens-before" and "sequenced-before" (or analogs, or just replicating meaning of these terms everywhere)?"
To describe the same facts, two descriptions are not going to be so radically different that they do not use common elements, of course. I'm only criticising that the proposed specification seems more suitable for automatic tools and egghead theoreticians (only meant affectionately, of course) than real-life programmers, once a proud race of self-reliant people, now reduced to pitiful shadows of themselves, unsure whether the code they once wrote and confidently used in production still passes muster with the new standard (one tiny little benign race somewhere in a corner and the whole program nevertheless comes to a screeching halt, but only if you're lucky!), and getting hooked like a drug addict on that nifty data-flow analysis tool where the intention is to interpret the clues to make changes in the program until one day hopefully the race-free status icon turns green.

"How would you define following code in terms of fences:

[...]

Please, provide sketch of definitions "based on fences". Probably I am missing something, how this can be defined much simplier in terms of fences?"
I plead guilty to newspeak with regard to "fences", and I will try to avoid using the word for a while.

I'm not sure, because I don't quite understand the specification, but I would feel a lot less disempowered if I could reason along the lines that thread 1 conceptually writes flag={release(data),1}, then thread 2 writes flag={relaxed,2}, and then thread 3 fails to synchronise because it sees the wrong label (so the assert fails). Would "consume" semantics in thread 2 save the day, maybe, and what exactly would be the toll in performance?

How much more satisfying would it also be if programmers were at least deemed worthy to know about the mechanics of it all, so that there is at least hope of taking control of our own destiny beyond the confines of the virtual reality presented to us. What does it really mean to perform a release or an acquire, and exactly what goes wrong when thread 2 intervenes (perhaps running on a core close to the one running thread 1, seeing a preview of the update on flag and being able to interfere with the cache coherence protocol in such a way that it can get its new value of flag out in front of the data released by thread 1)? What hardware works that way, what do the protocols look like, in principle? How would a "consume" be implemented to restore causal reasoning? Now there's just formal semantics with no link to anything tangible, so either you join a cult of people who like nothing more than to recite abstract incantations, or you just accept that the world you thought you knew has changed into something unrecognisable and incomprehensible, where you're like the farmer who can no longer collect his own seeds for the next season but has to buy them from the GMO oligopolists. (Too much? I find it wonderfully therapeutic, though...)

Multithreading is of course a ridiculously flawed approach to parallelism, because so much can go wrong. It would be far better to have solutions that free us from having to consider all possible interactions, or at least to only use conduits that have limited scope, like queues that take only self-contained elements from here to there like science-fiction wormholes between universes, not to mention entirely new paradigms like functional languages, free of user-visible side effects. Maybe metadata could be elevated in the language to be able to express not only invariants, but what synchronisation mechanisms are used with each specific variable, instead of considering either no state transfer (relaxed) or full-state transfer (release/acquire), and I wouldn't be surprised if this could even be used to improve performance.

But meanwhile we have C++, and we have to find a way to make its users actually understand it again.

RafSchietekat · ‎04-01-2009

#26 "But meanwhile we have C++, and we have to find a way to make its users actually understand it again."
My advice to the committee: don't spring this on the hoi poloi after your next druid convention yet (note how I'm epically bridging locations if not eras), first find something that resonates with the masses, not just with the vendors of design tools that will do heavy-duty program flow graph crunching because the human element is not in control anymore. Take some lessons from moribund CORBA: at least get sufficient experience with a reference implementation (maybe "sequentially consistent" is just a red herring?), and only then make it a standard. Does it need to be like what in my country would be called a "program law" (everything and the kitchen sink in one big text, rendering the parliament powerless, also with regard to individual items that the government wants to smuggle through because they otherwise wouldn't pass)? Do a TR for multithreading (except for no-brainers like legislating memory locations), and instead give us labeled loops already (like in Java), and try/finally (synchronous destruction of local objects doesn't work, unless with the significant clutter of a friend declaration and member references to function-local variables, just to be able to do, e.g., exit tracing and postcondition checking (cleanup would remain c/o RAII), which now forces single-return programming guidelines upon innocent cubicle dwellers all over the world, even though it's not even exception-safe).

RafSchietekat · ‎04-01-2009

#27 "maybe "sequentially consistent" is just a red herring?"
One of the correct choices TBB made, I think, is to give tbb::atomic release/acquire semantics. I extended the range of possibilities, but I didn't keep release/acquire by default merely for backward compatibility. I think that this default atomic is sort of like the natural dual of a lock, which also has acquire/release semantics, except the other way around: a lock is active holding a mutex, with interesting stuff happening between acquire and release, while an atomic would be "active" tunnelling state from one end (before the release-store) to the other (after the load-acquire), after which time you can safely lose interest. No seq_cst anywhere in sight, and nobody is missing it (well, I'm not).

I also decided to make atomic configurable with a default-semantics policy (which unfortunately required a prefix instead of a template argument, because of technical reasons related to C++). It is far safer to reason about an atomic like that. So now an atomic has a default policy rel_acq, which implies default release for a store, acquire for a load, and rel_acq when store and load are combined. You can override the default policy, and a relaxed_atomic would be relaxed about all its operations, while an ordered_atomic would make all its operations ordered (note that mut_ord will probably be the equivalent of C++0x's seq_cst, while ordered, somewhat heavier, will be sequentially consistent with any counterparty). Even though it remains possible, normally you would not override an operation's policy-directed default semantics, which eliminates a lot of room for mistakes, takes away some of the clutter, and makes it far easier to reason about what an atomic is doing.

The example overrides what would normally be a default-policy (rel_acq_)atomic with some explicitly relaxed operations, so this means that the user is alerted that something not altogether safe is going on, so he would not easily be fooled.

Together with my other criticisms against the current proposal for atomics, I think this demonstrates that we may not be getting the best deal if the current proposal goes to print as-is.

RafSchietekat · ‎04-13-2009

Before I possibly continue my comments about "Foundations of the C++ Concurrency Memory Model" (#6), here are already some observations about another text that references it, and that was also recommended to me: "Thread Basics", "a much more introductory paper [that] explains the underlying philosophy".

About the introduction:

"For present purposes hardware threads are similar enough to multiple cores, which are similar enough to multiple processors on separate chips, that we don't make the distinction."
And yet the distinction is so very important to any decision on whether to relax Independent-Reads-Independent-Writes sequential consistency, as discussed in the previous article. I always get suspicious when too much reality seems to get sacrificed for an abstraction.

"Even "experts" have often advocated contradictory approaches."
Without examples or even just references, that's somewhat of a cheap shot.

"To us, this is an easier model to explain. And programs built to be correct under this model will execute correctly on a genuinely parallel implementation. (Unfortunately, programs that are incorrect under this model may fail in much more mysterious ways on multicore implementations.)"
So it's a simplification of what can happen, but "correct" programs under the simplification are also "correct" in the unrestricted case? How can that be? Where is it explained? Isn't it deceptive to first only look at a simplified situation?

About "The problem with pure sequential consistency":

"And, as we will see below, insisting on pure sequential consistency generally does not provide us with any real advantage."
Whatever does that mean?

About "Data races and a more restrictive model":

"We then guarantee sequential consistency only when the program avoids data races."
That's certainly a lot nicer than saying that the presence of a data race makes for "undefined" program behaviour. But what does it mean? That a data race is like a catastrophe between sequential consistency and executions that are not simply interleaved? I'm rather underwhelmed at this point, but maybe I missed something? Also, how valuable is this result supposed to be, and why?

About "Why we haven't lost anything important":

"At first glance, the restriction of sequential consistency to data-race-free programs seems like an unnecessary complication for multithreaded programming, which is certainly already a sufficiently difficult problem. However, there are other important reasons to discourage data races. In fact, this restriction affects almost exclusively programs that we genuinely want to discourage."
This should be interesting... but after reading the section I'm not satisfied: I'm looking for the crucial difference between data races that make the program misbehave and ones that don't. Sure, it is generally a good thing to avoid them altogether, but why should they be so uniformly disastrous?

"Even if it is a synchronization variable, the increment may consist of two atomic accesses to c as above, instead of a single one to both read and write c, thus allowing the same interleaving as above."
This makes no sense: why wouldn't a "synchronization variable" have an atomic increment? A Java volatile is a strange creature, but I see no reason to replicate its quirks elsewhere.

"Programs with data races almost never produce consistently correct results across various hardware and compiler platforms. (The literature sometimes talks about "benign data races". Most of them are not benign in this sense.) Thus it makes little sense to sacrifice performance to accommodate such programs that are almost certainly incorrect."
Now that's devious: I feel that "undefined behaviour" is right around the corner. And who is talking about sacrificing performance: the example in the first article reviewed here, by the same author, is about better performance c/o a benign data race!

About "Avoiding data races at higher levels of abstraction":

"In the general case, it is the library's responsibility to specify what combinations of concurrent calls will introduce data races."
It's funny you should say that, because N2800 doesn't even specify whether std::vector::operator[] is safe, let alone concurrent use of the reference so obtained.

About "A few Java specifics":

"Hence it cannot allow arbitrary behavior for data races."
That's nice. So will the compiler reject the program because it cannot prove that it is race-free? No, apparently "very sophisticated authors" may still use them. So why shouldn't this be allowed in C++ as well, without rendering behaviour of the entire program "undefined" (which means that one would be lucky if the program just crashes)?

About "A few C++0x specifics":

"Following established convention, we will use the term C++0x to refer to the next C++ standard, in spite of the fact that we do not expect it to be formally adopted as an ISO standard until 2010 or 2011."
We can only hope that it won't be rushed out in 2009...

"Since C++ is not designed to provide protection against untrusted code, it guarantees nothing in the event of a data race. Any program allowing a data race produces "undefined behavior"."
And that's crazy: why the radicalism, why shouldn't there be intermediate guarantees and prohibitions that allow the programmer to take responsibility? Why wouldn't any damage be local to an affected variable?

"In the absence of data races, and provided that certain low-level library facilities are not used (see below) it again provides sequential consistency. Also like Java, this is not immediately apparent from the official language description. In this case the language description is made more complicated not by the treatment of data races, but by those low-level library facilities."
Again, why is "sequential consistency" so all-important that it should have such a radical impact, when in fact it does both too much (often standing in the way of performance with, e.g., reference counting) and too little (see "Avoiding data races at higher levels of abstraction") to be a panacea? Or am I confused by the use of seq_cst for what is decreed to be the default mode for atomics (although it seems hard to imagine that whoever chose this name didn't want to bundle it all up)? It does sound nice (who would want to be "inconsistent"), but experience has taught that it comes with a severe performance penalty.

"std::lock_guard _(mtx);"
Is that really a recommendable "name"? Wouldn't it be more self-documenting to name it "anonymous", and less dangerous (losing that tiny little underscore breaks the protection)?

"It currently appears that the C++0x model will be followed in some other environments as well."
Bush was once re-elected. And languages have to last longer than U.S.A. presidents.

About "Some more examples":

"Hence a compiler is not allowed to transform the code as we have suggested here, in spite of the fact that hardware conditional move instructions often make the last version faster, or it might allow a containing loop to be vectorized."
I had not seen this example before, but it seems rather deceptive: first the code is deemed not to have a data race based on a specific initial state because the model wants to condemn all data races and therefore has to have a minimal concept of data race to be at all acceptable, and then the optimised code supposedly introduced a race because the specific initial state is not taken into consideration to optimise away all code in Thread 1. While at first it seems satisfying that the compiler is also punished, it's still the user who pays (with missed performance). So what does he get in return, really? Similarly for the loop example that follows, where the performance penalty is even greater. I'm leaning toward allowing the compiler to perform the optimisations, but this certainly merits further study (how about an example from a real program?).

About "A note on implementation of synchronization operations (optional)":

"It is included here only in the hope that we might help reconcile the presentation in the rest of this paper with the reader's preconceptions of the subject."
Much appreciated...

"(Architectures that do not require such alignment usually cannot guarantee this kind of atomicity, since a single load or store instruction may require multiple accesses to physical memory.)"
That's confusing: it seems to imply that a processor couldn't guarantee indivisible aligned accesses if it also allowed unaligned accesses, which I believe several do (at a performance penalty). But I trust it's just unfortunate wording.

"A load instruction is said to become visible at the point at which it stops being affected by later stores."
Now what could that possible mean... in English? This also means that the meaning of the next bullet point is unclear, and the one after that an apparent contradiction (why would an architecture whose stores become visible in a single total order need a fence to order stores?).

"This description is far too sloppy to serve as the basis of a real implementation"
I'll say!

"The fence preceding the assignment to the synchronization variable x_init prevents this reordering."
Now I'm very confused, what with the earlier statement "stores become visible in a single total order". Or am I just biased by the SPARC hierarchy PSO
"The fence following the x_init is not necessary in this example, but is required in the general case to prevent reordering of a synchronization store with a subsequent synchronization load."
But very likely we're not interested in preventing such reordering, only in release semantics. The second fence is not the same beast at all: a #StoreLoad (in SPARC parlance) is far more expensive than a #StoreStore. In what world would this not be the case, and how doesn't that make the example mostly irrelevant? (At this point it would be very interesting to explain how fences are implemented in hardware.)

"(Real implementations would typically avoid the repeated xchg operations while waiting for a lock, since they create large amounts of unnecessary communication between processor caches.)"
Is this about backoff strategies?

"Note that lock()/unlock() operations only require one fence each, not two like a synchronization store operation. This relies on an argument that the difference is not observable in the absence of data races."
I only have difficulty understanding why this might need any consideration at all. To me sequential consistency is a mostly unnecessary and costly burden, and memory_order_seq_cst as a default likely a mistake.

RafSchietekat · ‎05-05-2009

When standards people say it is undefinable, they sometimes really mean I don't understand it. - Paul E. McKenney

I stumbled upon a slides presentation (dated 2008-02-01) by a certain Paul E. McKenney (IBM Distinguished Engineer). He apparently also doesn't get it (the sequential consistency thing), but then he's only a lowly practitioner ("And I have been doing parallel C for about 17 years"), not an exalted academic ("People listen to academics, even when we practitioners think that they shouldn't :-)"). He also seems to think that now it's too late to do anything anymore ("But starting in 2005 might have produced an alternative to sequential consistency"). Sigh...

RafSchietekat · ‎05-07-2009

Apparently some people (one of them the aforementioned Paul E. McKenney, Ph.D.) actually think it's a good thing to be... inconsistent (sounds awful, doesn't it?), and they're calling it "Relativistic Programming". I'm not sure that the opposite of something that I don't entirely trust is necessarily better, but it now seems worth having a look.

Here's my intuition: People are ill equipped to track the kind of dependencies that are involved in the make-up of a sequentially consistent program as C++0x would have it (we're really not even very good at following a single train of thought). We need not just crutches, but heavy machinery to help us there, and if the alternative is the hell and damnation of "undefined" program behaviour due to a single little innocent mistake, that's not what I would consider a happy place. Moreover, computers don't like it, and they're telling us by dragging their feet when running programs that require it.

So why exactly are we being coerced in the direction of "sequential consistency" again? Even now, on the brink of ratification, I have yet to see anything to convince me that this is it. Maybe after everything is cast in stone, somebody will condescend to our level and give a satisfactory explanation for the rest of us, but I'm not holding my breath, because I think that the simplicity of sequential consistency is merely deceptive. And of course then it may well be too late.

Instead, I'd rather drop this heavy burden, and use an environment that better serves its purpose as the interface between my human mind and the computer's mechanical brain, where brittleness should be mitigated rather than exacerbated. Let me worry about a manageable scope only, and then hand it over by doing a release on a baton variable that somebody else will then use to acquire my state, as in a relay run. If I ever care about a common timeline (which won't be very often), I'll use a dedicated device, but otherwise I shouldn't have to be bothered about it.

I may well be mistaken, but now I know that this wouldn't be just because I don't have a Ph.D. to my name (although I again admit that this is just my intuition, so you should make up your own mind independently).