Lockfree_mpmc and scalability ...

aminer10 · ‎05-28-2012

Hello all,

I have finally found why lockfree_mpmc doesn't scale...

you can get the the source code of lockfree_mpmc from:

http://pages.videotron.com/aminer/

So please follow with me..

If you take a look at lockfree_mpmc object pascal

source code you will read this on the push side:

---

function TLockfree_MPMC.push(tm : tNodeQueue):boolean;
var lasttail,newtemp:longword;
i,j:integer;
begin

if getlength >= fsize
then
begin
result:=false;
exit;
end;

result:=true;

newTemp:=LockedIncLong(temp);
lastTail:=newTemp-1;

setObject(lastTail,tm);

repeat
if CAS(tail,lasttail,newtemp)
then
begin
exit;
end;
asm pause end;
until false;
end;

---

When i have tested the push() side with 4 threads i have noticed that lockfree_mpmc
doesn't scale at all., in fact i have got a retrograde throughput, that means that
i got less throughput than on a single thread test.. and i have finally found
why lockfree_mpmc doesn't scale. When you are using a lockfree_mpmc
on a single thread test the CAS does read and update the variables on the
level 1 cache, and it's fast, but when you are using 4 threads it does get
too slow cause we are reading and updating from the L2 and from the memory.

I have thried to play with the affinity mask and i have found that when i am
using two threads on my tests and reading and updating from the same level 2 cache
it does scale a little bit more and i have got more throughput with two threads

on different cores and on the same level 2 cache than the single threadtest.

I have alsomodified lockfree_mpmc to not touch the CAS and
the cache when tail and lasttail are not equal by using the following code inside
the repeat until loop:

if tail <> lasttail
then
begin
continue;
end;

and it does give better performance with this method

here is the final code of the push() side of lockfree_mpmc..

i think i will modify the pop() side like that...

---
function TLockfree_MPMC.push(tm : tNodeQueue):boolean;
var lasttail,newtemp:longword;
i,j:integer;
begin

if getlength >= fsize
then
begin
result:=false;
exit;
end;

result:=true;

newTemp:=LockedIncLong(temp);
lastTail:=newTemp-1;

setObject(lastTail,tm);

repeat

if tail <> lasttail
then
begin
continue;
end;

if CAS(tail,lasttail,newtemp)
then
begin
exit;
end;
asm pause end;
until false;
end;
---

But as i have said before lockfree_mpmc doesn't scale when we are

using different cores and WE ARE NOT sharing the same cache,

that means that on my Intel Core 2 Quad Q6600 it does scale only

when we are using 2 threads on different cores that shares the same

level2 cache.

Thank you.

Amine Moulay Ramdane.

aminer10 · ‎05-28-2012

Hello,

I have only tested lockfree_mpmc ona Intel Core 2 Quad Q6600,
i don't have here an L3 cache, but perhaps lockfree_mpmc
will scale on an x86 that have an L3 cache.

I didn't tested it with an L3 cache, can you please do it for me
if you have an L3 cache and a quad core or morecore on your x86 computer ?

Just download the Lockfree MPMC and SPMC fifo queues version 1.12
from http://pages.videotron.com/aminer/and look inside the zip
file i have put a push.pas and a pop.pas tests, just open for example the
push.pas test and test it with a single threadsandafter that with 4 threads
bygiving the variable a the value of 1 andafter that 4 and after that just
email me the throughput for 1 and 4 threads.

Thank you.

Amine Moulay Ramdane.

aminer10 · ‎05-28-2012

Hello,

Here it is, i have compiled the push.pas to
push1.exe (one thread) and push4.exe (four threads)

Here they are:

http://pages.videotron.com/aminer/push1.exe

http://pages.videotron.com/aminer/push4.exe

If your computer is an x86 and you have an L3 cache
and your computer have 4 or more cores , can please
run those two programs and give me there output...

Thank you.

Amine Moulay Ramdane.

aminer10 · ‎05-29-2012

Hello,

I have receaived the benchmarks from some persons
that have an L3 cache, and i have noticed that lockfree_mpmc
doesn't scale either on with an L3 cache.
Do you know why this lock free fifo doesn't scale, cause
look at the following code on the push() side:

--

function TLockfree_MPMC.push(tm : tNodeQueue):boolean;
var lasttail,newtemp:longword;
i,j:integer;
begin

if getlength >= fsize
then
begin
result:=false;
exit;
end;
result:=true;
newTemp:=LockedIncLong(temp);

lastTail:=newTemp-1;
setObject(lastTail,tm);

repeat

if CAS(tail,lasttail,newtemp)
then
begin
exit;
end;
asm pause end;

until false;

end;

---

You have two thinks:

[1] newTemp:=LockedIncLong(temp);

[2] CAS(tail,lasttail,newtemp)

In the 4 threads scenario , as you can see
in [1] temp has to be loaded from the L3 cache
of the other cores on computers that have an L3 cache
but on my also from memory on my Intel Core 2 Quad Q6600
that doesn't have an L2 cache(just an L2 cache for every two cores) ,
so that will make the the four thread test with an L3 cache a little bit
slower than the single thread version and much slower without an

L3 cache compared to the single thread version that loads the values

from the L1 cache. That's the same for [2] , tail has to be loaded the same

way.

It's whyi am getting a retrograde throughput on my
Intel Core 2 Quad Q6600 and alomost the same thoughput
as the single thread on a computer with an L3 cache.

In the two thread scenario, you have to do a load
from the local L2 cache in [1] and [2] and this loads makes
the S part of the Amadahl equation much bigger than
the P part, it's why the two threads version doens't scale
either.

So in general i think it's not possible to make lockfree
fifo queues to scale on x86 when the lockfree code is sharing variables
between the cores, cause sharing variables is so expensive..

Thank you.

Amine Moulay Ramdane.

aminer10 · ‎05-29-2012

Hello,

Of course i was speaking about the x86 architecture...

Amine Moulay Ramdane.

aminer10 · ‎05-29-2012

I have corrected some typos , please read again...

Hello,

I have received the benchmarks from some persons
that have an L3 cache, and i have noticed that lockfree_mpmc
doesn't scale either on with an L3 cache.
Do you know why this lock free fifo doesn't scale, cause
look at the following code on the push() side:

--

function TLockfree_MPMC.push(tm : tNodeQueue):boolean;
var lasttail,newtemp:longword;
i,j:integer;
begin

if getlength >= fsize
then
begin
result:=false;
exit;
end;
result:=true;
newTemp:=LockedIncLong(temp);

lastTail:=newTemp-1;
setObject(lastTail,tm);

repeat

if CAS(tail,lasttail,newtemp)
then
begin
exit;
end;
asm pause end;

until false;

end;

---

You have two thinks:

[1] newTemp:=LockedIncLong(temp);

[2] CAS(tail,lasttail,newtemp)

In the 4 threads scenario , as you can see in [1] temp has to be

loaded from the L3 cache on computers that have an L3 cache ,

but on my Intel Core 2 Quad Q6600that doesn't have an

L3 cache(just an L2 cache for every two cores) i think it has to

be loaded from memory, so that will make the four thread test

with an L3 cache a little bit slower than the single thread version

that loads the values from the L1 cache and much slower on a computer

without an L3 cache. That's the same for [2] , tail has to be loaded the

same way.

It's why i am getting a retrograde throughput with four threads

on my Intel Core 2 Quad Q6600 and almost the same thoughput

as the single thread on a computer with an L3 cache.

In the two thread scenario, you have to do a load
from the local L2 cache in [1] and [2] and this loads makes
the S part of the Amadahl equation much bigger than
the P part, it's why the two threads version doens't scale
either.

So in general i think it's not possible to make lockfree
fifo queues to scale when the lockfree code is sharing variables
between the cores, cause sharing variables is so expensive..

Thank you.

Amine Moulay Ramdane.

aminer10 · ‎05-29-2012

I wrote:

>In the two thread scenario, you have to do a load
> from the local L2 cache in [1] and [2] and this loads makes
> the S part of the Amadahl equation much bigger than
> the P part, it's why the two threads version doens't scale
> either.
>
> So in general i think it's not possible to make lockfree
> fifo queues to scale when the lockfree code is sharing variables
> between the cores, cause sharing variables is so expensive..

I mean the CAS and the sharing of the variablesmakethe S part of

lockfree_mpmc much bigger than the S part andfrom the Amadahl equestion

this makeslockfree_mpmcnot scalable.

So in general i think it's not possible to make lockfree
fifo queues to scale when the lockfree code is sharing variables
between the cores and you are using CASes, cause sharing variables

and using CAS are so expensive..

Thank you.

Amine Moulay Ramdane.

aminer10 · ‎05-29-2012

Hello,

Even the following code inside push() method:

if getlength >= fsize
then
begin
result:=false;
exit;
end;

If you hav e noticedgetlength() method is sharing
variables between the cores and making the 4 threads
test much slower than the single thread test on
Intel Core 2 Quad Q6600, and i have tested it on my computer,

it makes it much slower cause the cache to cache tranfer is costly

on Intel Core 2 Quad Q6600, but on new architechtures that have

and L3 cache and hypertransport, it gives the same throughput

onsingle thread and four threads but it doesn't scale with four threads.
.
So the following parts are sharing variables between the cores and making
the 4 threads test much slower than the single thread test.on Intel Core 2 Quad Q6600:

[1] getlength()

[1] newTemp:=LockedIncLong(temp); ...

[2] and CAS(tail,lasttail,newtemp) also...

It's why i have told you that the CAS and sharing variables
are expensive.

Thank you.

Amine Moulay Ramdane.