My bet: if you take care to put the mutex in the same cache line as what you want to protect, the only significant penalty you pay relative to an atomic operation is the risk of convoying, but I have nothing to back that up (no pun intended). I see no problem of scalability (if used with my super duper spin lock algorithm), and if the protected sections are short enough, STM doesn't buy you any performance either (only ease of use). Just make your protected sections short enough by, e.g., taking a snapshot, doing any expensive operations, getting a lock, making sure there are no ABA issues, and completing the "transaction" ("optimistic locking"?). Again, like those expensive operations, all this is completely speculative: please correct any misconceptions. You (and I) should probably also look into the optimisations that can be had from techniques involving mostly-read data, where it still pays to incur a huge penalty for writes because so many more reads were so much faster (as I understand it); I'm sure Dmitriy Vyukov can provide some pointers (again).
#4 "Precise representation of Java's volatile is atomic variable with sequentially consistent operations on it."
How do you figure that? Aren't the only sequentially consistent among themselves (rhetorical)? Anyway, that's what I modelled with memory semantics "mut_ord" (for "mutually ordered"), which is less strict and, because it wouldn't make sense otherwise, maybe a little bit faster than "ordered" (I didn't like "seq_cst" for "sequentially consistent" because you never quite know what it means either anyway, as this discussion shows), although not on all architectures (no x86 I don't recall that it makes any difference).
#5 "I can't follow you here. volatile variables are limited in size just as atomics (otherwise they are also implemented with something like mutex). So what's the problem?"
I'm not sure it was implied that Java volatile applies to entire objects, but I do recall some discussions about applying atomic to entire classes etc. I'd have to check what C++0x is planning to do there.
#6 "There is just no scalable centralized shared data-structures. Period."
Admitted, but there are still degrees of scalability: some mechanisms quickly degrade through convoying (locks) and livelock or something very similar to it (bad locks, bad lock-free algorithms).
On a related point, what do you think: will Rock rock?