A question about scalability ... - Page 2

aminer10 · ‎09-07-2010

Hello,

I didn't know where to ask this question, so i decided to ask here..

I just read yesterday the following page, look at the the USL
(Universal Law of Computational Scalability) of Dr. Gunther,
he wrote this: ( see http://en.wikipedia.org/wiki/Neil_J._Gunther)

--------------------------

The relative capacity C(N) of a computational platform is given by:

C(N) = N
-------------------
1 + (N - 1) + N (N - 1)

where N represents either the number of physical processors
in the hardware configuration or the number of users driving the
software application. The parameters and represent respectively
the levels of contention (e.g., queueing for shared resources) and
coherency delay (i.e., latency for data to become consistent) in the
system. The parameter also quantifies the retrograde throughput seen
in many stress tests but not accounted for in either Amdahl's law or
event-based simulations.

---------------

His website: http://www.perfdynamics.com/

If you read carefully , you will see that Dr. Gunther is using this
model to predict scalability after he simulates a relatively small
number
of vusers in LoadRunner ( because of licensing costs, it's cost-
effectiveness)
and after that he finds the coefficients of the 2nd-degree polynomial
(quadratic equation) and then transform those coefficients back to the
USL parameters using the = b - a and = a.

And than he is extrapolating with the USL model to higher loads
to predict scalability.

He is also applying the model to webservers with heterogeneous
workloads. like in the following page:

http://perfdynamics.blogspot.com/2009/04/assessing-usl-scalability-wi...

Now my question follows:

Suppose we have obtained a small number of measured load-points
with Loadrunner or others tools, and we calculated the USL equation to
predict scalability of a webserver , how the USL model can predict if
the scalability/performance is limited by the network bandwidth and
not the server ?

Can it give this information ?

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

aminer10 · ‎09-16-2010

Hello again,

We have to calculate the AVERAGE waiting time on
the locked section under contention...

So i correct:

f T is the time spend in the locked section, and let us
have a scenario of four threads, so under heavy contention
and in the worst case, if the threads are all waiting for the
locked-section (locks,false sharing etc.) the first thread will not
wait, second thread will wait T and third thread will wait 2 * T
and fourth thread will wait 3 * T...

And 1+ 2 +3 +4 ... n is an arithmetic serie and is equal to:

(n * ( n + 1)) / 2

and

Limit (n -> infinite) of (n * (n + 1) /2) = infinite

So, this mathematical term diverge when n -> infinite

We have to replace n by (n -1) in the arithmethic serie,
since waiting thread number n on the locked section will
be waiting T * (n -1)

The average waiting time will be (T * ( (n-1) * (n)) / 2 )) / n

So my scalability equation is:

Speed = 1 / ((( S * ((n-1) * (n)) / 2 )/n) + (P / N))

S: serial part
P: parallel part
n: number of threads (or processes)
N: number of cores

and the Limit (n-> infinite) of ((n-1) * (n)) / 2 = infinite

this term diverge..

And my equation clearly show that under contention and when
there is many threads we can have a retrogade throuput as with
lock-free Ringbuffer and lock-free flqueue ...

Please look at:

http://pages.videotron.com/aminer/scalability.htm

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

aminer10 · ‎09-16-2010

I wrote:

So my scalability equation is now:

Speed = 1 / (S * ((((n-1) * (n)) / 2 )/n) + (P / N))

S: serial part
P: parallel part
n: number of threads (or processes)
N: number of cores

and the Limit (n-> infinite) of ((n-1) * (n)) / 2 = infinite

this term diverge.. and of course as n grows we will have
a retrograde throuput...

As you have noticed i am using the average waiting time:

T * ((((n-1) * (n)) / 2) / n)

It's the average waiting time that the application
is not running in parallel and where there is contention
on the locked region.

And of course n (the number of threads or processes)
MUST not equal 0.

Welcome:

http://pages.videotron.com/aminer/scalability.htm

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

Regards,
Amine Moulay Ramdane.

aminer10 · ‎09-16-2010

I wrote:
>And of course n (the number of threads or processes)
>MUST not equal 0

I mean n must not equal 1.

n must be >= 2

Regards,
Amine Moulay Ramdane.

Aubrey_W_ · ‎09-16-2010

Hello Amine,

We really appreciate your contributions, but we ask that you try to combine your posts into a single forum thread as much as possible. It just makes it easier for users to navigate the front page of the forum. I've merged your two part thread on capacity planning and scalability, and we're looking at merging some of your other threads.

But once again, thanks for your contributions!

==
Aubrey W.
Intel Software Network Support

Aubrey_W_ · ‎09-16-2010

I merged two more threads into this one.

And the one below. Sorry, that one is out of sequence because I didn't notice it at first.

Best regards,

==
Aubrey W.
Intel Software Network Support

aminer10 · ‎09-16-2010

Hello,

I wrote:

>So my scalability equation in a worst case contention is:

>Speed = 1 / (S * ((((n-1) * (n)) / 2 )/n) + (P / N))
>S: serial part
>P: parallel part
>n: number of threads (or processes)
>N: number of cores

Now, if we want a general equation that is more realistic, we have
to add a contention factor in my equation , we will call it C in
percentage(the percentage of contention that we have),
and finally we will add a factor called = C * n
so that my scalability equation will become:

Speed = 1 / (S * ((((-1) * ()) / 2 )/) + (P / N))

S: serial part
P: parallel part
n: number of threads (or processes)
N: number of cores
C: contention factor in percentage,

C = average number of contended thread / n

and = C * n

And of course n (number of threads or processes) must be >= 2

And my equation clearly show that under contention and when
there is many threads we can have a retrogade throuput such the one
that we have in lock-free Ringbuffer and lock-free flqueue ...

Welcome:

http://pages.videotron.com/aminer/scalability.htm

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

aminer10 · ‎09-16-2010

I wrote:

>Now, if we want a general equation that is more realistic, we have
>to add a contention factor in my equation , we will call it C in
>percentage(the percentage of contention that we have),
>and finally we will add a factor called = C * n
>so that my scalability equation will become:
>
>
> Speed = 1 / (S * ((((-1) * ()) / 2 )/) + (P / N))
>
> S: serial part
> P: parallel part
> n: number of threads (or processes)
> N: number of cores
> C: contention factor in percentage,
>
> C = average number of contended thread / n
>
> and = C * n
>
> And of course n (number of threads or processes) must be >= 2

Now you can apply my mathematical model..

I wrote:

> C = average number of contended thread / n

Software compagnies must provide a mechanism/functions inside
the lock prmitives mutex,critical sections etc. or the intel
lock primitive..., that extract and calculate this average
number of contended threads when we are are runnig our parallel
application , and from this, you will just plug those numbers in
my mathematical model and calculate the scalability.

Welcome: http://pages.videotron.com/aminer/scalability.htm

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

aminer10 · ‎09-16-2010

Hello,

My new scalability mathematical model:

Speed = 1 / (S * ((((-1) * ()) / 2 )/) + (P / N))

S: serial part
P: parallel part
n: number of threads (or processes)
N: number of cores
C: contention factor in percentage,

C = average number of contended threads / n

and = C * n

You can of course calculate the S and P and
the average number of contended threads (or processes).

And you can also take average number of contended threads equal
for example to 0.50 (50% or you can simulate it as you wich)
and simulate and predict the scalability with my new mathematical
model, by changing the following variables: N(number of cores)
and n (number of threads or processes) .. and of course you can
change the variable "average number of contended threads" as you
wich in your mathematical simulation using my new mathematical model.

Welcome: http://pages.videotron.com/aminer/scalability.htm

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

aminer10 · ‎09-16-2010

Hello ,

If you take a look at my Parallel Compression Library

http://pages.videotron.com/aminer/

i have benchmarked Parallel Gzip (that works on TMemoryStreams and
TFileStreams),and it's giving a good performance , almost the same
performance as Pigz.exe ...

Now , i want to optimize more my Parallel Compression Library...

If you take a look at my Parallel Compression Library in Object Pascal

take a look inside the zipfile: http://pages.videotron.com/aminer/

(look inside parallelcompression.pas)

It's usingamethod like this:

-----

procedure TCallbacks.GzipCompress(obj: Pointer);
var local_count:integer;
local_stream:TMemoryStream;
c_ok:boolean;
begin

try
local_stream:=TMemoryStream.create;

GZCompressStream(Tjob(obj).stream,local_stream,Tjob(obj).compressionlevel);

local_stream.position:=0;

repeat
if count_compress = Tjob(obj).index
then
begin

Tjob(obj).fstream.CopyFrom(local_stream,local_stream.size);
local_count:=count_compress+1;
count_compress:=local_count;
break;
end;
until false ;
finally
Tjob(obj).stream.free;
local_stream.free;
// Tjob(obj).free;

end;

if Tjob(obj).r > 0
Then
begin
if local_count = Tjob(obj).number +1
then
begin
Tjob(obj).event.setevent;
Tjob(obj).free;
end;
end
else
begin
if local_count = Tjob(obj).number
then
Begin
Tjob(obj).event.setevent;
Tjob(obj).free;
end;
end;

end;

---

As you have noticed i am calling:

Tjob(obj).fstream.CopyFrom(local_stream,local_stream.size);

So theworker threads of my Threadpool engineare also usingthe IO.

But we know that:

For threads that may wait for I/O to complete,we haveto increase
the Threadpool pool size beyond the number of available cores,
because not all threads of my Parallel Gzip compression will be working
at all times (since they are using the IO and waiting for its completion).

So my question follows:

Must iuse profiling tool,and estimate the ratio of waiting time (WT)
to service time (ST) for a typical request.and calculate the ratio WT/ST,
and after that estimate the number of Threadpool workers that i must
have, by using the following equation:

Number of threads to keep thecores fully utilized = N*(1+WT/ST)

Or must i use for example my TAsyncFileStream - asynchronous
file input/output , to not wait at all,that you find inside my LockfreeUtils,
look inside the zip of Lockfreeutils:

http://pages.videotron.com/aminer/

Regards,
Amine Moulay Ramdane.

aminer10 · ‎09-17-2010

Hello

I will clarify something about my scalability mathematical equation..

Now, if we want a general equation that is more realistic, we have
to add a contention factor in my equation , we will call it C in
percentage(the percentage of contention that we have),
and finally we will add a factor called = C * n

So that my scalability equation will become:

Speed = 1 / Soc + (Sic * ((((-1) * ()) / 2 )/) + (P / N))

Sic: serial part inside the critical region(locked regions)
Soc: serial part outside the critical region
P: parallel part
n: number of threads (or processes)
N: number of cores
C: contention factor in percentage,

C = average number of contended thread / n
and = C * n

And of course n (number of threads or processes) must be >= 2

But to be more realistic we have to neglect sometimes the Soc part,
as is the case when we run 'serially' the main thread - and there is
no other serial parts other than the critical regions (locked-regions) -
and when after that we start our multiple threads .. i think the
serial part of the main thread, in this case, can be neglected ,
so that my equation become:

Speed = 1 / (Sic * ((((-1) * ()) / 2 )/) + (P / N))

Sic: serial part inside the critical region(locked regions)
P: parallel part
n: number of threads (or processes)
N: number of cores
C: contention factor in percentage,

I wrote before:
> C = average number of contended threads / n

Software compagnies must provide a mechanism/functions inside
the lock primitives mutex,critical sections etc. or the intel
lock primitive..., that extract and calculate this average
number of contended threads when we are are running our parallel
application , and from this, you will just plug those numbers in
my mathematical model and calculate the scalability.

And you can also take average number of contended threads equal
for example to 0.50 (50% or you can simulate it as you wich)
and simulate and predict the scalability with my new mathematical
model, by changing the following variables: N(number of cores)
and n (number of threads or processes) .. and of course you can
change the variable "average number of contended threads" as you
wich in your mathematical simulation using my new mathematical model.

And as you have noticed , i said explicitly that you can simulate my
mathematical model by varying the N (number of cores), n(number of
threads) and the "average number of contended threads", but implicitly,
since the Soc, Sic and P are not varying, that means that you are keeping
the other configurations 'constant', example: the cores's frequency, and
keeping the same hardisks configuration etc.

And my equation clearly show that under contention and when
there is many threads we can have a retrogade throuput such as the one
that we have in lock-free Ringbuffer and lock-free flqueue ...

Please read here my article:

http://pages.videotron.com/aminer/scalability.htm

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

aminer10 · ‎09-17-2010

I wrote:

But to be more realistic, we have to neglect in many cases the Soc(serial part
outside the critical regions), as is the case when we run 'serially' the main thread
- and there is no other serial parts other than the critical regions (locked-regions) -
and when after that we start our multiple threads .. i think the serial part of
the main thread, in this case, can be neglected , so that my equation
become: Speed = 1 / (Sic * ((((-1) * ()) / 2 )/) + (P / N))

Sic: serial part inside the critical region(locked regions)
P: parallel part
n: number of threads (or processes)
N: number of cores
C: contention factor in percentage,

Please read here my article:

http://pages.videotron.com/aminer/scalability.htm

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

aminer10 · ‎09-17-2010

Hello again,

I will clarify more...

Butto be more realistic, we have to neglect in many cases the Soc part,
as is the case when we run 'serially' the main thread - and there is
no other serial parts other than the critical regions (locked-regions) -
and when after that we start our multiple threads .. i think the serial part of
the main thread, in this case, can be neglected , so that my scalability equation
becomes: Speed = 1 / (Sic * ((((-1) * ()) / 2 )/) + (P / N))

Sic: serial part inside the critical region(locked regions)
P: parallel part
n: number of threads (or processes)
N: number of cores
C: contention factor in percentage,

To clarify this more:

My scalability equation can be viewed as function of multiple
variables (even the time) , like this

Speed(t,N,,Sic,Soc) = 1 / (Sic * ((((-1) * ()) / 2 )/) + (P / N))

But if the Soc part ( of the main thread) is big (let say 50%) ,
and you run your pogram in a short time interval, let say the running
time interval is t = 1 minute, the Soc part will have a big effect on my
scalability equation, and can not be neglected... But if you run the
same program in a long period of time , as is the case when we run
our servers applications, the Soc part (of the main thread) will have
a small effect and can be neglected, so that my scalability equation
become:

Speed = 1 / (Sic * ((((-1) * ()) / 2 )/) + (P / N))

Please read here my article:

http://pages.videotron.com/aminer/scalability.htm

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

aminer10 · ‎09-17-2010

Hello again,

My scalability equation can be viewed as function of multiple
variables (even the time) , like this

Speed(t,N,,Sic,Soc) = 1 / (Sic * ((((-1) * ()) / 2 )/) + (P / N))

But if the Soc part ( of the main thread) is big (let say 50%) ,
and you run your pogram in a very short time interval, let say
the running time interval is t = 10 seconds, the Soc part will have
a big effect on my scalability equation, and can not be neglected...
But if you run the same program in a long period of time , as is
the case when we run our servers applications, the Soc part (of
the main thread) will have a small effect and can be neglected,
so that my scalability equation become:

Speed = 1 / (Sic * ((((-1) * ()) / 2 )/) + (P / N))

The example above will look like this in function of the time t...

|--only the main thread is running (Soc part) -->
--> the multiple threads start--->
---> the mutiple threads are running and main thread waiting -->

My article:

http://pages.videotron.com/aminer/scalability.htm

That's all.

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

aminer10 · ‎09-17-2010

Hello,

We have to be smart here, so please follow with me...

My new scalability mathematical model says that:

http://pages.videotron.com/aminer/scalability.htm

[1]

Speed(t,N,,Sic,Soc) = 1 / (Sic * ((((-1) * ()) / 2 )/) + (P / N))

Sic: serial part inside the critical region(locked regions)
Soc: serial part outside the critical region
P: parallel part
n: number of threads (or processes)
N: number of cores
C: contention factor in percentage,

C = average number of contended threads / n

and = C * n

And of course n (number of threads or processes) must be >= 2

So if you take a look at (benchmarks) graphs:

http://pages.videotron.com/aminer/parallelqueue/parallelqueue.htm

You will notice that my lock-free ParallelQueue is scaling better
than lockfree RingBuffer and flqueue, cause my lock-free
ParallelQueue
is lock-free and it's keeping the locked region very small, just
a CAS(longword(tail[0]),longword(lasttail),longword(newtemp))
on the Push() side, and a CAS(head[0],lasthead,lasthead+1)
on the Pop() side , so it is reducing the Sic part in my
scalability...

And

Also i am using lock-striping in my lock-free algorithm , that
means that i am REDUCING the Sic part in my scalability equation
even more, but since the Sic is very small, but it still exists, my
scalability mathematical model predict that when n(number of threads)
will grow, = C * n will grow also, but since in equation [1] above
the
factor ((((-1) * ()) / 2 )/) = ((-1) * ()) / (2 * ) is growing
much faster
in the numerator ((-1) * ()) than the denominator (2 * ), that
means
that there will be a retrogade throuput when n (number of threads)
becomes much bigger, and this rectrograde throuput is reached
more quickly with lock-free flqueue and lock-free Ringbuffer than
with my lock-free Parallelqueue.

Welcome to my webpages:

http://pages.videotron.com/aminer/scalability.htm

and

http://pages.videotron.com/aminer/

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

aminer10 · ‎09-18-2010

Hello,

As you have noticed, i have spook in my last post about my new
scalability mathematical model:

Read "A new scalability mathematical model better than Amdahl's law"

http://pages.videotron.com/aminer/scalability.htm

My scalability equation can be viewed as function of multiple
variables (even the time) , like this

[1]

Speed(t,N,,Sic,Soc) = 1 / (Sic * ((((-1) * ()) / 2 )/) + (P / N))

Sic: serial part inside the critical region(locked regions)
Soc: serial part outside the critical region
P: parallel part
n: number of threads (or processes)
N: number of cores
C: contention factor in percentage,

C = average number of contended threads / n

and = C * n

And of course n (number of threads or processes) must be >= 2

But how in pratice can we reduce more the contention ?

There is logical contention: contention for access to data structures
that need to be controlled to maintain the consistency...

and

physical contention: contention for physical memory, buses, harddisks ...

As you have noticed i wrote in:

http://pages.videotron.com/aminer/parallelqueue/parallelqueue.htm

"Because my lock-free ParallelQueue algorithm is using a hash
based method - and lock striping - to redue more the contention
and is using just a LockedInc() , so, i am REDUCING the duration
for which locks are held AND REDUCING the frequency with which
locks are requested, hence i am REDUCING A LOT the contention,
so it's very efficient."

Hence, i am reducing the 'logical' contention in my lock-free ParallelQueue...

But there is still physical contention that can occur when we are
using the memory , buses , hardisks...

Now how can we reduce this physical contention ?

To reduce the Sic part in my scalability equation [1] , you can
for example maximize the locality and use more caches (bigger size)
and use a hardware with a big L3 cache, and use for example
RAID 10 or ...with more cache memory and mirroring.

If this is not enough, you an for example scale out your website,
by using other computer servers(and use mirroring) and load balance
between them..

I wrote in:

http://pages.videotron.com/aminer/efficiency_availibility.htm

"So, as you have noticed you can simulate our enterprise
webserver with this mathematical analytical equation and
validate the data collected by fwptt stress webserrer tool
and try to keep T( the average response time of your HTTP requests)
bellow or equal to 5 seconds, and see also if your webserver
is scaling, and we call this also Quality control using OR
(operational research). "

So, if you want to keep the response time T 'low' in enterprise websites,
you have to reduce the logical and physical contention as i explained
before...

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

aminer10 · ‎09-19-2010

Hello all,

I have thought more about my mathematical model

http://pages.videotron.com/aminer/scalability.htm

I have not take into account a factor, and there is still a problem
with my mathematical model.

I explain:

When . in the worst case, all threads are waiting for the lock ,
my equation works.

For physical contention , my mathematical model works, and
it does predict the retrogade throuput.

To model for example my lock-free ParallelQueue or lock-free Ringbuffer
or lock-free flqueue under contention, it works, and it does predict the
retrogade throuput.

But there still a problem with my mathematical model,
if there is contention , and the scheduler schedules threads
other than the ones that are waiting on the lock, my mathematical
model doesn't work correctly .

But it was a good attempt from my part to try to find a general
scalability equation that also predict the retrograde throuput.

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

aminer10 · ‎09-21-2010

Hello,

I was thinking more about those lock-free Queues...

If you take a look at lock-free flqueue:

http://pages.videotron.com/aminer/parallelqueue/parallelqueue.htm

Look at the code of the push()

http://pages.videotron.com/aminer/parallelqueue/parallelqueue.htm

---
function TLockfree_MPMC.pop(var obj:tNodeQueue):boolean;

var lastHead : longword;
begin
repeat
(1)
lastHead:=head[0];
if tail[0]<>head[0]
then
begin
obj:=getObject(integer(lastHead));
(2)
if CAS(head[0],lasthead,lasthead+1)
then
begin
result:=true;
exit;
end;
end
else
begin
result:=false;
exit;
end;
until false;
end;
---

So, if head have changed the entire loop and the code
beween (1) and (2) have to be reexecuted , so the P(parallel) part
will become larger, and this can become more expensive than
the scenario with only one thead, it's why there is a retrograde
throuput in lock-free flqueue.

But if you take a look at the push() inside lock-free flqueue source code ,
it looks like this:

---
function TLockfree_MPMC.push(tm : tNodeQueue):boolean;//stdcall;
var lasttail,newtemp:integer;
begin
if getlength >= (fsize - margin)
then
begin
result:=false;
exit;
end;
result:=true;
//newTemp:=windows.interlockedincrement(temp);
newTemp:=LockedInc(temp);

lastTail:=newTemp-1;
setObject(lastTail,tm);
repeat
if CAS(longword(tail[0]),longword(lasttail),longword(newtemp))
then
begin
//{$IFDEF MUTIPLE_PRODUCER}
// sleep(0);
//{$ENDIF}
exit;
end;
sleep(0);
until false;
end;

---

So,as you have noticed it doesn't look like the pop(), if lasttail
have changed that doesn't mean that the thread will fail and retry
and repeat the transaction.. and also, the loop between repeat until
is not so expensive than the loop in the pop() side .

This is why lock-free flqueue is experiencing a retrograde throuput
in the push() side, but not in the pop() side (look at the graphs)

But with lock-free Ringbuffer there is a retrograde throuput in both
the pop() and push().

With my lock-free ParallelQueue there is no retrograde throuput
in the graphs, cause i am using lock striping and avoiding all those
problems, also, my ParallelQueue is avoiding contention and i am
using it inside my Threadpool engine.

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer

gaston-hillar · ‎09-21-2010

Amine,

What is your question?

This is a forum and you seem to be wondering about an equation...

gaston-hillar · ‎09-21-2010

Amine,

I think that you should write a nice article about your new equations and your ideas. However, writing 20 times the same equation in this forum doesn't make sense. This behavior is usually known as SPAM... :(

aminer10 · ‎09-21-2010

I wrote:
>So,as you have noticed it doesn't look like the pop(), if lasttail
>have changed that doesn't mean that the thread will fail and retry
>and repeat the transaction..

I mean if tail have changed that doesn't mean that the thread
will reexecute the loop...

If we have thread A,B,C,D running in this order , and the first
threadA will execute the CAS, B,C, and Dafter that may
succeed andwill not reexecute the loop ... and if they fail and
reexecute the loop, the loop is less expensive thanin the push() side.

But in the push() if we have threads A,B,C,D that are executing
in this order beween (1) and (2) and the first threadA enters the CAS

(1)
lastHead:=head[0];
if tail[0]<>head[0]
then
begin
obj:=getObject(integer(lastHead));
(2)
if CAS(head[0],lasthead,lasthead+1

All the loop have to berexecuted for B,C Dand this loop is
more expensivethat therepeat until in the pop side, this
is why lock-free flqueue is experiencing a retrograde throuput.

But with lock-free Ringbuffer there is a retrograde throuput in both
the pop() and push().

With my lock-free ParallelQueue there is no retrograde throuput
in the graphs, cause i am using lock striping and avoiding all those
problems, also, my ParallelQueue is avoiding contention and i am
using it inside my Threadpool engine.

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer

aminer10 · ‎09-21-2010

Hello,

We have to reason and be smart when we are dealing with
those kind of lock-free algorithms

Finally, and to be more clear, i will say that :

In the following graph:

http://pages.videotron.com/aminer/parallelqueue/parallelqueue.htm

the retrograde throuput that you will notice in lock-free
flqueue is not caused by the factor 'contention' , it's
caused by the fact that when threads are executing
beteween (1) and (2) (in the pop() side):

(1)
lastHead:=head[0];
if tail[0]<>head[0]
then
begin
obj:=getObject(integer(lastHead));
(2)
if CAS(head[0],lasthead,lasthead+1

if head have changed all the threads have to reexecute
the loop and this loop is more expensive than the
repeat-until in the pop() side, this is why lock-free
flqueue is experiencing a retrograde throuput.

But in the push() side:

repeat
if CAS(longword(tail[0]),longword(lasttail),longword(newtemp))
then
begin
//{$IFDEF MUTIPLE_PRODUCER}
// sleep(0);
//{$ENDIF}
exit;
end;
sleep(0);
until false;
end;

The loop in the push() side is not so expensive, and when
the thread are executing between (1) (2) and tail
have changed that doesn't mean that the threads have
to reexecute the loop, they may succeed as i have
explained before.

Also, you have to take into account the factor that we call
CONTENTION , in medium and high contention both lock-free
flqueue and lock-free RingBuffer will behave badly, but
since my lock-free ParallelQueue is using lock striping
it is well suited for medium to high contention.

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer