Local spinning lock not faster - Page 2

capens__nicolas · ‎03-05-2009

Hi all,

I've done a bit of testing of the Threading Building Blocks library on my Core i7 and found that the queuing_mutex isn't any faster than either a simple exponential backoff lock or a ticket lock. So I wonder why that is and if there's any faster alternative out there (in particular for implementing fairly short but contested critical sections).

Is it no faster because it has a higher overhead and the effect of spinning on a local cache line is rather small for 8 threads and only becomes more significant with a higher number of cores? Or does the Core i7's snoop filter already compensate for not spinning on a local cache line for the other locks?

Any explanation and advice for creating a (very) high performance critical section would be much appreciated!

Thanks,

Nicolas

Dmitry_Vyukov · ‎03-08-2009

Quoting - jimdempseyatthecove

Dmitriy,

Thanks for taking the time and interest in writing this test.

There are three flaws with your test:

1) very minor flaw -remove "2" off of "thread_func2" (missing function)
2) Medium flaw - The code as written is sensitive to data placement
3) Major flaw - assumption that _mm_pause is equal for all threads.

I did some experimentation on my system (Q6600 Windows XP x64)

as you can see, using your original struct and _mm_pause I too experienced the performance difference on x32 code.

Running x64 code on same struct and _mm_pause was inconsistant with x32 run (closer to same run timews)

Tweakingalignment could produce same run times (x64)

Then reverting back to original struct (without __declspec and difference in padd)
** But replacing _mm_pause with small computation

Now x32 has equal run times for both test, and x64 has equal runtimes for both tests.

Flaw 1. Yes, it must be "thread_func". I was just making some more tests with that code.
Flaw 2. What exactly do you mean? I've only separated variables onto different cache lines. I think it's perfectly Ok since we are talking about synchronization primitives.
Flaw 3. I especially measured times when thread 2 and thread 3 access different variables (second table in my post). Times are ideally equal and stable. So I think _mm_pause() is not the problem.

I believe that the only reason why times are equal in your test is that you have increased local work, i.e. reduced contention.

Dmitry_Vyukov · ‎03-08-2009

Quoting - jimdempseyatthecove

Now I did some more testing while reducing the computation time of the dummy compute section (simply removed the sqrt). Now the times for test1 and test2 are skewed again.

I think the true answer lies somewhere deeper into the internals of the cache sub-system (on this processor) to think we can get a definitive answer from running this simple test.

Indeed!
3 situations are possible:
(1) Contention is too low, so that even 4 threads do not saturate interconnects.
(2) Contention is too high, so that even 2 threads saturate interconnects.
(3) Contention is medium, so that 2 threads saturate interconnects, but 4 threads saturate.

I've tuned my test so that it is (3) (or (2) - it's difficult to say).
In your post #16 you've simulated (1) thus you've got equal times. Now you simulate (3) (or (2)) too. Thus skewed times.

At least for now my simple mantal model explains all the results :)

Dmitry_Vyukov · ‎03-08-2009

"It would appear that as you increase the probability of collision for mutex (in this case _InterlockedIncrement) that you do tend to see a skew from test1 to test2, although there are some placement cases that illustrate contrary to this assumption."

What placement cases?

"The original premis of my post was that it is advised to reduce the probability of collision through partitioning means or other programming means."

Even if contention on individual objects is reduced, total contention (sum of all contentions) is also plays the role (this is what I was trying to show).

I cannot imaging that that would have the effect on the difference of the skew ratios between x32 and x64. It is likely the slight difference in the loop overhead. And this seems to have a drastic difference (in some cases ~40% difference).

Difference in loop overheads affects contention, and contention along with loop overheads determines run-times. It's a kind of quadratic dependence.

Dmitry_Vyukov · ‎03-08-2009

Quoting - jimdempseyatthecove

So when the affinities are set such that the mutex lies within the same cache of the threads sharing the mutex then we see very good scaling of the access via LOCK xxx.

This is an additional benefit of having the capability of affinity based scheduling (as is available in QuickThread).

With such adjustment I do agree with your statement.

But affinity of threads to cores to caches to data is not so easy to achieve. "Chaotic" threading + locking/atomic reference counting will definitely NOT achieve this.

QuickThread is indeed powerful. QuickThread provides means for exactly this - bind all threads that access particular mutex/data to one cache (on node on NUMA system).

jimdempseyatthecove · ‎03-08-2009

Dmitriy, and all who follow this thread.

I am in the process of submitting a "paper" on this subject to the multi-core Blogs site. So peek in there over the next few days (I don't know how long the approval time will be). I have one addition to make to the paper (will do that tomorrow). I will try to include a sample on how you can perform a passive mutex (there probably is a better name for it).

Jim

Dmitry_Vyukov · ‎03-11-2009

Interesting results of ping-pong test on several machines (including 32-core Itanium2-based NUMA beast):
http://groups.google.com/group/lock-free/msg/f6693bef3ab1ef50