Lockfree_mpmc and lockfree ParallelQueue ... - Page 2

aminer10 · ‎10-20-2010

Hello all,

As you will notice, even inside NASA and other institutions
in France (Ariane...) are making errors in there programs...

Read this:

"Although unexpected computer resets and human errors
continue to frustrate NASA/JPL scientists, both Pathfinder
and Sojourner... Current priorities involve fixing
the software bug that is causing Pathfinder to reset its
computer"

"The trouble experienced by the Mars lander "Mars Pathfinder"
is a classic example of problems caused by priority inversion
in realtime systems"

And don't forget about the french Ariane 5 rocket explosion:

http://www.youtube.com/watch?v=z-r9cYp3tTE

So, i have just updated and corrected a problem with
lockfree_mpmc, if you take a look at lockfree_mpmc.pas
inside my lockfree Parallel Queue zip file:

http://pages.videotron.com/aminer/parallelqueue/parallelqueue.htm

You will notice that i have modified the algorithm in the push()
side, and i have used a test like this:

if getlength >= fsize
then
begin
result:=false;
exit;
end;

Now i have tested my new algorithm and it is working very well.

Correctness:

To not make the correctness verification longer, i will concentrate on
the more important parts. If you take a look at the lock-free ParallelQueue
source code , inside lockfree_mpmc.pas you will read this in push side:

function TLockfree_MPMC.push(tm : tNodeQueue):boolean;//stdcall;
var lasttail,newtemp:longword;
begin
if getlength >= fsize
then
begin
result:=false;
exit;
end;
result:=true;
//newTemp:=windows.interlockedincrement(temp);
newTemp:=LockedIncLong(temp);

lastTail:=newTemp-1;
setObject(lastTail,tm);
repeat
if CAS(tail[0],lasttail,newtemp)
then
begin
exit;
end;
sleep(0);
until false;
end;

So, let's say the size of the bounded queue is 1000 , and imagine
that the threads are executing the "if getlength >= fsize " all at the
same time, and imagine that the getlength returns 999, so, the
"if getlength >= fsize" will returns false , and since we have
fSize:=(1 shl aPower) - margin , with a margin of 1000
(margin must be >= to the number of cores) , there will be no
problem(overflow..)...

Now in the pop side, you will read this:

function TLockfree_MPMC.pop(var obj:tNodeQueue):boolean;

var lastHead : longword;
begin
repeat
lastHead:=head[0];
if tail[0]<>head[0]
then
begin
obj:=getObject(lastHead);
if CAS(head[0],lasthead,lasthead+1)
then
begin
result:=true;
exit;
end;
end
else
begin
result:=false;
exit;
end;
until false;
end;

So, as you have noticed, there is a test like this:
if CAS(head[0],lasthead,lasthead+1) , and this test
avoids something like the ABA problem, cause if head
have changed , the CAS() will fail.

I have updated my Threadpool engine with the new lockfree_mpmc,
please download my other modules that use lockfree_mpmc.

Sincerely
Amine Moulay Ramdane.

Dmitry_Vyukov · ‎10-20-2010

Quoting aminer10

Dmitry wrote:
>And what if another thread will push margin items in between? Won't you still get overflow?

margin is limited to 1000 threads in the 1.22 version, it will work.

Imagine that fsize is 1000 and in the worst case we are at 999 items, if
all the 1000 threads have crossed there will still be no problem, no overflow.

The algorithm is correct now.

Do you want to say that each thread is allowed to enqueue at most 1 element per lifetime?

aminer10 · ‎10-20-2010

Dmitry wrote
>Do you want to say that each thread is allowed to enqueue at most 1 element per lifetime?

You can have 1000 threads (you can change that easily...),
and each thread can enqueue 1 element...

that's still cool Dmitry !

I didn't lose hope and finaly it's working ! :)

and thank you very much for the bug that you catched in getlength() !

And you didn't answered my other post titled "An important question..."...

Amine Moulay Ramdane.

aminer10 · ‎10-20-2010

Dmitry write
>Do you want to say that each thread is allowed to enqueue at most 1 element per lifetime?

I clarify ...

I mean that in the worst case situation if fsize is 1000 and there is 999 items,
even if all the threads have crossed before the CAS() in the push side
that increments head, his will still work, and there will be no overflow.

Do you understand now?

Amine Moulay Ramdane.

Dmitry_Vyukov · ‎10-20-2010

Quoting aminer10

And you didn't answered my other post titled "An important question..."...

I do not need it, I prefer to read code.

Dmitry_Vyukov · ‎10-20-2010

Quoting aminer10

I mean that in the worst case situation if fsize is 1000 and there is 999 items,
even if all the threads have crossed before the CAS() in the push side
that increments head, his will still work, and there will be no overflow.

Do you understand now?

And what if all threads have crossed before CAS 2 times? Or 1 thread has crossed before CAS 1000000 times?

aminer10 · ‎10-20-2010

Dmitry write
>Do you want to say that each thread is allowed to enqueue at most 1 element per lifetime?

I clarify ...

I mean that in the worst case situation if fsize is 1000 and there is 999 items,
even if all the threads have crossed before the CAS() in the push side
that increments head, this will still work, and there will be no overflow.

To simplify the reasonning about the correctness,i haveconcentrated
only on this worst case situation that i gave, i have excluded the other situations
like if other threads have dequeued in between, cause i know that they will
work...

it's this worst case situation that is critical and important and this worst
case situation does capture all the necessary elements that easy
the reasonning aboutthe correctness of the algoritm...

Do you understand now?

Amine Moulay Ramdane.

aminer10 · ‎10-20-2010

>And what if all threads have crossed before CAS 2 times?

this will work

>Or 1 thread has crossed before CAS 1000000 >times?

this will work...

it's the worst case situation that i gave that capture the necessary
elements... if fsize = 1000 and we have 999 items, the worst case
situation will be that all the 1000 threads have crossed before CAS,
and this will still work... think about it more Dmitry ...

This kind of reasonning that i used (using the worst case situation...)
SIMPLFIESthe reasonning about the correctness, like when we are
reasonning with INVARIANTS equations.

Sincerely,
Amine Moulay Ramdane.

aminer10 · ‎10-20-2010

I wrote:
>This kind of reasonning that i used (using the worst case situation...)
>SIMPLFIESthe reasonning about the correctness, like when we are ..

I mean simplifies = make my reasonningsimpler and much easier,
like when you are reasonning with place invariants in Petri Nets,
that make the reasonning much easier...

>reasonning with INVARIANTS equations.

Sincerely,
Amine Moulay Ramdane

Dmitry_Vyukov · ‎10-20-2010

Quoting aminer10

>And what if all threads have crossed before CAS 2 times?

this will work

Ok, I see. Threads CAN'T cross 2 times, because they will be blocked on the spin mutex (repeat if CAS(tail[0],lasttail,newtemp))

Dmitry_Vyukov · ‎10-20-2010

It looks like it works... if Pascal guarantees sequential consistency and atomicity for all memory accesses. Does it?

aminer10 · ‎10-20-2010

Correct Dmitry.

Amine.

aminer10 · ‎10-20-2010

I wrote:
>I mean that in the worst case situation if fsize is 1000 and there is 999 items,
>even if all the threads have crossed before the CAS() in the push side
>that increments head, this will still work, and there will be no overflow

Imean the push method that increments tail.

Amine.

aminer10 · ‎10-20-2010

Dmitry wrote:
>It looks like it works... if Pascal guarantees sequential consistency
>and atomicity for all memory accesses. Does it?

The variable inside an Object pascal object are aligned in 32 bits..

And i am using a CAS ..

is it not enough ?

What do you mean ?

Amine.

Dmitry_Vyukov · ‎10-20-2010

Quoting aminer10

The variable inside an Object pascal object are aligned in 32 bits..

And i am using a CAS ..

is it not enough ?

What do you mean ?

You are also use plain loads and stores.

Dmitry_Vyukov · ‎10-20-2010

Quoting Dmitriy Vyukov

Quoting aminer10

The variable inside an Object pascal object are aligned in 32 bits..

And i am using a CAS ..

is it not enough ?

What do you mean ?

You are also use plain loads and stores.

For example, in C/C++ plain loads and stores provide neither atomicity nor memory ordering.

aminer10 · ‎10-20-2010

I have a question Dmitry..

I have tried to benchmark a locked version against the lockfree version
and the lockfree version is faster by a large margin ...

Now my question:

When we execute something like a lock instruction in assembler
does the OScan deschedulethe thread even if it isexecuting
inside the lockassembler instruction inside the CAS ?

Amine.

aminer10 · ‎10-20-2010

Dmitry wrote:
>For example, in C/C++ plain loads and stores provide neither atomicity nor memory ordering

I will ask the compiler designer...

When the variables are aligned in 32 bit , are the loads and stored not atomic ?

>memory ordering

Is the CAS (in assembler) that i am usingnot sufficient?
is it necessaryto use the interlocked functions provided by the Windows OS ?

Amine.

aminer10 · ‎10-20-2010

>memory ordering

Is the CAS (in assembler) that i am usingnot sufficient?

Must i use a fence inside my CAS in assembler to force the ordering ?

Amine.

aminer10 · ‎10-20-2010

Hello,

List a lot of memory ordering guarantees in the Intel x86 memory model:

Loads are not reordered with other loads.

Stores are not reordered with other stores.

Stores are not reordered with older loads.

In a multiprocessor system, memory ordering obeys causality (memory ordering respects transitive visibility).

In a multiprocessor system, stores to the same location have a total order.

In a multiprocessor system, locked instructions have a total order.

Loads and stores are not reordered with locked instructions.

ButLoads may be reordered with older stores to different locations

Now i have looked at the lockfreE_mpmc source code and in the
push side there is no need for an Mfence.

But in the pop() sidethere is a need for an MfenceafterlastHead:=head[0];
and before if tail[0]<>head[0] , and i think there is a
need for an Mfenceafter the if tail[0]<>head[0] and before
the obj:=getObject(lastHead); and in the getlength() there a need for
an Mfence after tail1:=tail[0] and before if tail1 < head1 .

andlockcmpxchgdoes in factachieve Sequential consistency..

Am i correct or not, Dmitry?

Amine Moulay Ramdane.

aminer10 · ‎10-20-2010

Hello,

I correct ...

What is important for us is MEMORY VISIBILITY across threads..

So, the only variables that are used across threads are tail and head..

And since in a multiprocessor system, locked instructions have a total order.

Solockcmpxchgdoes in factachieve sequential consistency

and

LockedIncLong(temp) in the push sidealso achievesequential consistency

And tail and head are properly aligned to avoid false sharing...

So i think lockfree_mpmc algorithm is correct now.

Regards,
Amine Moulay Ramdane.

so i think the algorithm is correct.

Amine Moulay Ramdane

aminer10 · ‎10-20-2010

I wrote:
>
>I correct ...
>
>What is important for us is MEMORY VISIBILITY across threads..
>So, the only variables that are used across threads are tail and head..
>And since in a multiprocessor system, locked instructions have a total order.
>Sothe lockcmpxchg (inside the CAS)does in factachieve sequential consistency
>andLockedIncLong(temp) in the push sidealso achievesequential consistency
>And tail and head are properly aligned to avoid false sharing...
>So i think lockfree_mpmc algorithm is correct now.

I was speaking about sequential consistency in the x86 architecture of course..

Regards,
Amine Moulay Ramdane.