Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Spin-Locks vs. Atomic Instructions

AJ13
New Contributor I
3,822 Views
Hello all,

I have been implementing some algorithms from "The Art of Multiprocessor Programming" in TBB. I have found in this book, however, that it relies sometimes upon the Java volatile keyword. The best way to mimick this functionality in TBB, it seems, is to have a spin-lock wrapped around the variable that would have been volatile in Java.

I have found that the limited size of atomic variables is a problem with even simple algorithms. For instance, if I want to atomically alter two variables I need to pack them into a single atomic. What I have done so far, is use spin-locks around the variables that I would normally want to alter atomically. How well would this scale as the number of threads/cores increases?

Is software transactional memory worth looking at, performance wise? Can it really offer any better performance than spin-locks?

Do I have other alternatives to atomic/spin-locks in writing fast parallel algorithms?

Thanks,

AJ
0 Kudos
35 Replies
RafSchietekat
Valued Contributor III
2,893 Views

My bet: if you take care to put the mutex in the same cache line as what you want to protect, the only significant penalty you pay relative to an atomic operation is the risk of convoying, but I have nothing to back that up (no pun intended). I see no problem of scalability (if used with my super duper spin lock algorithm), and if the protected sections are short enough, STM doesn't buy you any performance either (only ease of use). Just make your protected sections short enough by, e.g., taking a snapshot, doing any expensive operations, getting a lock, making sure there are no ABA issues, and completing the "transaction" ("optimistic locking"?). Again, like those expensive operations, all this is completely speculative: please correct any misconceptions. You (and I) should probably also look into the optimisations that can be had from techniques involving mostly-read data, where it still pays to incur a huge penalty for writes because so many more reads were so much faster (as I understand it); I'm sure Dmitriy Vyukov can provide some pointers (again).

0 Kudos
robert_jay_gould
Beginner
2,893 Views
This reminds me of my dilemma with Atomic floats, the information there might be of use for you too, AJ. Anyways I also dreamed of a near future of transactional memory architectures, but as Arch pointed out in that thread(with some articles), transactional memory is still in its infancy and likely to be years down the road before it becomes really useful when performance matters. If you just need a little boost, or scalable system its probably ok, but not if your trying to juice the machine.

Anyways my tests of random access to arrays of spinlocked values and atomic values, proved spinlocks were better by a small margin, but then again as Raf mentions I had them on the same cache line as the values.


0 Kudos
AJ13
New Contributor I
2,893 Views
I actually found that the data structure I wrapped with a spin_lock performed very well. No official benchmarks though.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
2,893 Views
Quoting - AJ
I have been implementing some algorithms from "The Art of Multiprocessor Programming" in TBB. I have found in this book, however, that it relies sometimes upon the Java volatile keyword. The best way to mimick this functionality in TBB, it seems, is to have a spin-lock wrapped around the variable that would have been volatile in Java.


Precise representation of Java's volatile is atomic variable with sequentially consistent operations on it.
However in many cases an algorithm may not require sequential consistency, but Java just don't have more accurate instruments than sequentially consistent volatiles (shame on it). So if you are using TBB atomics you may replace sequentially consistent operations with acquire/release operations in such cases.
For example, on x86 sequentially consistent store is LOCK XCHG, however store-release is just plain MOV.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
2,893 Views
Quoting - AJ
I have found that the limited size of atomic variables is a problem with even simple algorithms. For instance, if I want to atomically alter two variables I need to pack them into a single atomic. What I have done so far, is use spin-locks around the variables that I would normally want to alter atomically. How well would this scale as the number of threads/cores increases?


I can't follow you here. volatile variables are limited in size just as atomics (otherwise they are also implemented with something like mutex). So what's the problem?

Note that:

volatile struct X
{
int m;
int n;
};

is no more than:

struct X
{
volatile int m;
volatile int n;
};

However I'm not sure that Java has structs and allows to declare them as volatile, so just ignore this part.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
2,893 Views
Quoting - AJ
What I have done so far, is use spin-locks around the variables that I would normally want to alter atomically. How well would this scale as the number of threads/cores increases?


If the data structure is under light load then it doesn't matter.
If the data structure is under heavy load then - no, spin-mutex is not scalable just as atomic.
There is just no scalable centralized shared data-structures. Period.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
2,893 Views
Quoting - AJ
Is software transactional memory worth looking at, performance wise? Can it really offer any better performance than spin-locks?


Nope. IMVHO, STM is the slowest of all techniques. If you are frequently mutating data structure then STM will offer horrible perfromance and scalability to you. If you are mostly reading data struture then some STM implementations will offer reasonable scalability by using something like SeqLock technique. But still there are techniques that can handle read mostly workloads much better.



0 Kudos
Dmitry_Vyukov
Valued Contributor I
2,893 Views
Quoting - AJ
Do I have other alternatives to atomic/spin-locks in writing fast parallel algorithms?


You looking to the wrong direction. Atomic operations may only somehow improve bad non-scalable design.
If you need scalability (scalability not like in STM - "Our STM offers good scalability... well, we don't say it to you, but on 16 cores our STM doesn't even close to single-threaded implementation...", but real scalability - 16 times faster on 16 cores) then you better look at the things like privatization, partitioning, immutability, distribution, amortization, copy-on-write, partial-copy-on-write, lock-free reader pattern, etc.
Consider following benchmarks:
http://software.intel.com/en-us/forums/showthread.php?t=60494
Spin-mutex implementation shows scalability of 0.63 even on 2 cores (linear scalability would be 2.0), even provided that hash-map is quite distributed data structure in itself. If you will measure some inherently centralized data structure performance degradation will be much more severe.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
2,893 Views
Quoting - Dmitriy Vyukov
If you need scalability (scalability not like in STM - "Our STM offers good scalability... well, we don't say it to you, but on 16 cores our STM doesn't even close to single-threaded implementation...", but real scalability - 16 times faster on 16 cores) t


Regarding STM. Consider Deuce-STM:
http://sites.google.com/site/deucestm/Home
(notice "...lets you develop highly concurrent applications..." on the main page).
Now consider benchmark results:
http://sites.google.com/site/deucestm/documentation/benchmarks-1
On fig. 2 you may see than lock-based implementation yields 3.5*10^6 ops/sec on 1 thread, let's assume that purely single-threaded implementation w/o lock would yield 5*10^6 ops/sec. So linear scalability on 16 cores would be 80*10^6. You may see that STM based implementation on 16 cores yields only 2.5*10^6 ops/sec.
Fig. 1 is even worse. Notice how they stretch Y-axis so that STM looks somehow scalable. LOL.
STM benchmarkers ought to draw on the graphs the line that represents abstract linear scalability of the purely single-threaded implementation (w/o locks) to show feebleness of the STM. LOL.
Note that TL2 STM algorithms is quite state of the art.

0 Kudos
RafSchietekat
Valued Contributor III
2,893 Views

#4 "Precise representation of Java's volatile is atomic variable with sequentially consistent operations on it."
How do you figure that? Aren't the only sequentially consistent among themselves (rhetorical)? Anyway, that's what I modelled with memory semantics "mut_ord" (for "mutually ordered"), which is less strict and, because it wouldn't make sense otherwise, maybe a little bit faster than "ordered" (I didn't like "seq_cst" for "sequentially consistent" because you never quite know what it means either anyway, as this discussion shows), although not on all architectures (no x86 I don't recall that it makes any difference).

#5 "I can't follow you here. volatile variables are limited in size just as atomics (otherwise they are also implemented with something like mutex). So what's the problem?"
I'm not sure it was implied that Java volatile applies to entire objects, but I do recall some discussions about applying atomic to entire classes etc. I'd have to check what C++0x is planning to do there.

#6 "There is just no scalable centralized shared data-structures. Period."
Admitted, but there are still degrees of scalability: some mechanisms quickly degrade through convoying (locks) and livelock or something very similar to it (bad locks, bad lock-free algorithms).

On a related point, what do you think: will Rock rock?

0 Kudos
Dmitry_Vyukov
Valued Contributor I
2,893 Views
Quoting - Raf Schietekat

#4 "Precise representation of Java's volatile is atomic variable with sequentially consistent operations on it."
How do you figure that? Aren't the only sequentially consistent among themselves (rhetorical)? Anyway, that's what I modelled with memory semantics "mut_ord" (for "mutually ordered"), which is less strict and, because it wouldn't make sense otherwise, maybe a little bit faster than "ordered" (I didn't like "seq_cst" for "sequentially consistent" because you never quite know what it means either anyway, as this discussion shows), although not on all architectures (no x86 I don't recall that it makes any difference).


I figured that from language spec. They order all memory accesses.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
2,893 Views
Quoting - Raf Schietekat

On a related point, what do you think: will Rock rock?


I can't say anything regarding Rock itself, but I think HTM feature will rock amongst hardcore low level programmers for whom it's worth doing optimizing for one particular Sun processor. I think that the main consumers will be Solaris and Sun JVM.
HTM is much better than STM because it allows overheadless perfectly scalable read-only transactions. Although here are some caveats too (at least for unmanaged world), for example, if one reads object from concurrent container, he must either include object processing into transaction or use something like reference counting, both options are not very good.
Mutable operations are also handled better to some degree. But because of it's limitations, I think that the reasonable usage is to construct double-CAS or triple-CAS from HTM, and then work in terms on these CASes.
But for now it's more of a theoretical interest for me, I am curious when AMD will finally implement HTM.

0 Kudos
RafSchietekat
Valued Contributor III
2,893 Views
#11 "I figured that from language spec. They order all memory accesses."
I'm afraid that this is just in your imagination, although I'm always eager to be educated. :-)

(Added) I'm prepared to assume that you know internally what you're talking about, but then you should add some more knowledge barriers so that we don't have to reconstruct what you know out of thin air (sorry, couldn't help myself). My take is that Java volatiles are only sequentially consistent with regard to other volatiles (not quite what you wrote); with regard to non-volatiles, they basically only do load-acquire or release-store. This is what I tried to mimick with mut_ord memory semantics (mutually ordered with other such atomics). The implication is that if you have an algorithm in C++ with ordered (C++0x's seq_cst) atomics, it is not enough (in general) to substitute Java volatile for just those atomics (doesn't RRD model them as weaker?); the other way around, substituting C++ ordered atomics for Java volatiles is more expensive than strictly needed (although I'm still assuming that the difference may be relatively minor). Am I mistaken?
0 Kudos
RafSchietekat
Valued Contributor III
2,893 Views
#0 "The best way to mimick this functionality in TBB, it seems, is to have a spin-lock wrapped around the variable that would have been volatile in Java."
If all "volatiles" that may interact share the same mutex, that would be sufficient, but more expensive than necessary, even disregarding the possibility that Java volatile is overkill. My patch's mut_ord_atomic is meant to be a closer match, although something more appropriate to the algorithm's needs can probably be found instead; feel free to share any benchmarking outcomes.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
2,893 Views
Quoting - Raf Schietekat
#11 "I figured that from language spec. They order all memory accesses."
I'm afraid that this is just in your imagination, although I'm always eager to be educated. :-)

(Added) I'm prepared to assume that you know internally what you're talking about, but then you should add some more knowledge barriers so that we don't have to reconstruct what you know out of thin air (sorry, couldn't help myself). My take is that Java volatiles are only sequentially consistent with regard to other volatiles (not quite what you wrote); with regard to non-volatiles, they basically only do load-acquire or release-store. This is what I tried to mimick with mut_ord memory semantics (mutually ordered with other such atomics). The implication is that if you have an algorithm in C++ with ordered (C++0x's seq_cst) atomics, it is not enough (in general) to substitute Java volatile for just those atomics (doesn't RRD model them as weaker?); the other way around, substituting C++ ordered atomics for Java volatiles is more expensive than strictly needed (although I'm still assuming that the difference may be relatively minor). Am I mistaken?

Accesses to java volatiles (along with other synchronization actions - mutex lock/unlock, thread start/end, etc) form total global order S of all synchronization actions (which agrees with program order). Any two actions of the same thread (including plain memory accesses) always connected by happens-before relation (according to program order).
This is what I was trying to say. Probably my English lets me down.
And I am still state that equivalent of java volatiles is atomic var with seq_cst operations on it.
I better just paste spec:


17.4.4 Synchronization Order
Every execution has a synchronization order. A synchronization order is a
total order over all of the synchronization actions of an execution. For each thread
t, the synchronization order of the synchronization actions (17.4.2) in t is consistent
with the program order (17.4.3) of t.
Synchronization actions induce the synchronized-with relation on actions...


17.4.5 Happens-before Order
Two actions can be ordered by a happens-before relationship. If one action happens-
before another, then the first is visible to and ordered before the second.
If we have two actions x and y, we write hb(x, y) to indicate that x happensbefore
y.
If x and y are actions of the same thread and x comes before y in program
order, then hb(x, y).
There is a happens-before edge from the end of a constructor of an object to
the start of a finalizer (12.6) for that object.
If an action x synchronizes-with a following action y, then we also have hb(x,
y).
If hb(x, y) and hb(y, z), then hb(x, z).
...


17.4.2 Actions
...
Synchronization actions, which are:
Volatile read. A volatile read of a variable.
Volatile write. A volatile write of a variable.
...


This is what I was trying to say.
0 Kudos
RafSchietekat
Valued Contributor III
2,893 Views
Let's narrow it down: do you agree that a normal load before a Java volatile load or a normal store after a Java volatile store can be reordered, and that this is not allowed with C++0x seq_cst, which would make them not equivalent? If you do not agree, why not?

(Added) Or maybe they are equivalent and my patch's "ordered" is stronger than C++0x seq_cst?
0 Kudos
Dmitry_Vyukov
Valued Contributor I
2,893 Views
Quoting - Raf Schietekat
Let's narrow it down: do you agree that a normal load before a Java volatile load or a normal store after a Java volatile store can be reordered, and that this is not allowed with C++0x seq_cst, which would make them not equivalent? If you do not agree, why not?

(Added) Or maybe they are equivalent and my patch's "ordered" is stronger than C++0x seq_cst?

I do not agree.

17.4.5 Happens-before Order
...
If x and y are actions of the same thread and x comes before y in program
order, then hb(x, y).
...

If you have normal load A sequenced before volatile load B in program order then A happenes-before B.
If you have normal store A sequenced after volatile store B in program order then B happens-before A.

0 Kudos
RafSchietekat
Valued Contributor III
2,893 Views
"I do not agree."
Well, apparently you didn't always see them as equivalent yourself. I'm just repeating what Doug Lea wrote in "The JSR-133 Cookbook" ("Can Reorder" table, near the top)... I'm pretty sure he has read the same things you are quoting, him being part of JSR 133's Expert Group and all (in case anybody's wondering: this specification was subsequently incorporated into the Java Language Specification). I can see the apparent contradiction, and I probably should dive into the JLS sometime to get to the bottom of this, but right now I have this table, and some vague memory of another text that ascribed total order only to the volatiles among themselves and specifically complained about how expensive this is on x86 without providing the same full-fence guarantees in general so that code would be written that would run only on x86 (in case anybody from Intel is gloating right now: I think Itanium would be affected, too), or something like that (I could have just dreamt it, though).

0 Kudos
Dmitry_Vyukov
Valued Contributor I
2,893 Views
Quoting - Raf Schietekat
"I do not agree."
Well, apparently you didn't always see them as equivalent yourself.

It has nothing to do with seq_cst atomics. It's about stand-alone seq_cst fences which are a bit strange in C++0x.


0 Kudos
Dmitry_Vyukov
Valued Contributor I
2,559 Views
Quoting - Raf Schietekat
I'm just repeating what Doug Lea wrote in "The JSR-133 Cookbook" ("Can Reorder" table, near the top)... I'm pretty sure he has read the same things you are quoting, him being part of JSR 133's Expert Group and all (in case anybody's wondering: this specification was subsequently incorporated into the Java Language Specification). I can see the apparent contradiction, and I probably should dive into the JLS sometime to get to the bottom of this, but right now I have this table, and some vague memory of another text that ascribed total order only to the volatiles among themselves and specifically complained about how expensive this is on x86 without providing the same full-fence guarantees in general so that code would be written that would run only on x86 (in case anybody from Intel is gloating right now: I think Itanium would be affected, too), or something like that (I could have just dreamt it, though).


I think you are mixing up formal semantics and implementation details. Notice the name of the paper - "... for compiler writers". I believe that possible reorderings you mentioned do not affect formal semantics. Consider, C++0x's seq_cst loads are implemented as plain loads on x86, there is no need to issue mfences before and after loads - formal sequential consistency is still provided.

0 Kudos
Reply